书上P79最下面的路径是不是错了啊？

ssymai · June 24, 2020, 12:48pm

还有第80页是不是应该withopen（msr_test）as test 啊？

hankcs · June 26, 2020, 7:08pm

路径没有错误，请注意代码中有将空格去掉：

hankcs/pyhanlp/blob/15ad039b6b9b29e77b1c644b2784613943488afb/tests/book/ch02/evaluate_cws.py#L73


sighan05 = ensure_data('icwb2-data', 'http://sighan.cs.uchicago.edu/bakeoff2005/data/icwb2-data.zip')
msr_dict = os.path.join(sighan05, 'gold', 'msr_training_words.utf8')
msr_test = os.path.join(sighan05, 'testing', 'msr_test.utf8')
msr_output = os.path.join(sighan05, 'testing', 'msr_output.txt')
msr_gold = os.path.join(sighan05, 'gold', 'msr_test_gold.utf8')


DoubleArrayTrieSegment = JClass('com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment')
segment = DoubleArrayTrieSegment([msr_dict]).enablePartOfSpeechTagging(True)
with open(msr_gold, encoding='utf-8') as test, open(msr_output, 'w', encoding='utf-8') as output:
    for line in test:
        output.write("  ".join(term.word for term in segment.seg(re.sub("\\s+", "", line))))
        output.write("\n")
print("P:%.2f R:%.2f F1:%.2f OOV-R:%.2f IV-R:%.2f" % prf(msr_gold, msr_output, segment.trie))

至于这么做的原因，请参考P81第二段末尾。

Java版仓库中也有去掉空格：

github.com

hankcs/HanLP/blob/a5efa696e3b3efa9a133ad54ab1bf7143b7ffc98/src/main/java/com/hankcs/hanlp/seg/common/CWSEvaluator.java#L197


 * @param dictPath   训练集单词列表
 * @return 一个储存准确率的结构
 * @throws IOException
 */
public static CWSEvaluator.Result evaluate(Segment segment, String outputPath, String goldFile, String dictPath) throws IOException
{
    IOUtil.LineIterator lineIterator = new IOUtil.LineIterator(goldFile);
    BufferedWriter bw = IOUtil.newBufferedWriter(outputPath);
    for (String line : lineIterator)
    {
        List<Term> termList = segment.seg(line.replaceAll("\\s+", "")); // 一些testFile与goldFile根本不匹配，比如MSR的testFile有些行缺少单词，所以用goldFile去掉空格代替
        int i = 0;
        for (Term term : termList)
        {
            bw.write(term.word);
            if (++i != termList.size())
                bw.write("  ");
        }
        bw.newLine();
    }
    bw.close();

但印刷版没有，已经联系出版社修改了。