对比1.x与2.0对时间词分词结果产生的一点疑惑

AliBug · February 6, 2020, 11:09am

例句：神州六号于2005年10月12日9:00把两名宇航员送上太空。

1.x 是把 2005年10月12日9:00 整体作为一个时间词识别出来的

2.0 在线版结果如下，“12” 与 “日” 被分为两个词, “9:00” 被识别为 “CD”（概数词）

我自己线下用2.0.0a35得到的结果则又与线上版本不一致，比如 “12日”是一个词了，而 “9:00” 被识别为“PU”（标点符号）

请问2.0 上怎么能把时间像1.x一样分成一个词呢？

# 线下代码
import hanlp

tokenizer = hanlp.load('CTB6_CONVSEG')
tagger = hanlp.load('CTB5_POS_RNN_FASTTEXT_ZH')

pipeline = hanlp.pipeline() \
    .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
    .append(tokenizer, output_key='tokens') \
    .append(tagger, output_key='part_of_speech_tags')

text = '神州六号于2005年10月12日9:00把两名宇航员送上太空。'

print(pipeline(text))

另一个例子: 2019年全国导游资格考试考试结果原定于2020年2月21日公布。

线下版本涉及到时间的年、月、日标注都正常。

hankcs · February 7, 2020, 3:31pm

需要从语料的角度统一标注。
线上线下的问题毫无意义：HanLP2.0测试版在线演示已上线

fancy · February 19, 2020, 12:00pm

你好，pyhanlp是否具有pipeline的功能呢？

hankcs · February 19, 2020, 2:31pm

github.com

hankcs/HanLP/blob/69329f954cae23fcca0b62761f5f969f3e457009/src/main/java/com/hankcs/hanlp/seg/SegmentPipeline.java#L24


import com.hankcs.hanlp.corpus.document.sentence.word.Word;
import com.hankcs.hanlp.corpus.tag.Nature;
import com.hankcs.hanlp.seg.common.Term;
import com.hankcs.hanlp.tokenizer.pipe.Pipe;


import java.util.*;


/**
 * @author hankcs
 */
public class SegmentPipeline extends Segment implements Pipe<String, List<Term>>, List<Pipe<List<IWord>, List<IWord>>>
{
    Pipe<String, List<IWord>> first;
    Pipe<List<IWord>, List<Term>> last;
    List<Pipe<List<IWord>, List<IWord>>> pipeList;


    private SegmentPipeline(Pipe<String, List<IWord>> first, Pipe<List<IWord>, List<Term>> last)
    {
        this.first = first;
        this.last = last;
        pipeList = new ArrayList<Pipe<List<IWord>, List<IWord>>>();

hankcs · April 29, 2021, 2:42pm

我看到这个帖子被引用了，其实这个问题在2.1里面得到了很好的解决：https://hanlp.hankcs.com/?sentence=神州六号于2005年10月12日9%3A00把两名宇航员送上太空。

所以关贴了。

hankcs · April 29, 2021, 2:42pm