请教 HanLP 2.0 中的 pipeline 中如何在最后一步进行根据自定义词典对模型分词结果合并

wenfeixiang1991 · April 13, 2020, 3:00am

想在 pipeline 中使用自定义词典后于模型生效，用于对模型分词结果的后合并，如 HanLP 1.x 版本那样。

from hanlp.common.trie import Trie

import hanlp

tokenizer = hanlp.load(‘PKU_NAME_MERGED_SIX_MONTHS_CONVSEG’)
text = ‘我国新能源汽车产量突破2亿台产量，是不是聪明人？NLP统计模型没有加规则，聪明人知道自己加。英文、数字、自定义词典统统都是规则。’
print(tokenizer(text))

trie = Trie()
trie.update({‘自定义’: ‘custom’, ‘词典’: ‘dict’, ‘聪明人’: ‘smart’, ‘国新能源’: ‘stock’})

def split_sents(text: str, trie: Trie):
words = trie.parse_longest(text)
sents = []
pre_start = 0
offsets = []
for word, value, start, end in words:
if pre_start != start:
sents.append(text[pre_start: start])
offsets.append(pre_start)
pre_start = end
if pre_start != len(text):
sents.append(text[pre_start:])
offsets.append(pre_start)
return sents, offsets, words

print(split_sents(text, trie))

def merge_parts(parts, offsets, words):
items = [(i, p) for (i, p) in zip(offsets, parts)]
#items += [(start, [word]) for (word, value, start, end) in words]
# In case you need the tag, use the following line instead
items += [(start, [(word, value)]) for (word, value, start, end) in words]
return [each for x in sorted(items) for each in x[1]]

tokenizer = hanlp.pipeline()
.append(split_sents, output_key=(‘parts’, ‘offsets’, ‘words’), trie=trie)
.append(tokenizer, input_key=‘parts’, output_key=‘tokens’)
.append(merge_parts, input_key=(‘tokens’, ‘offsets’, ‘words’), output_key=‘merged’)

print(tokenizer(text))

输出：
“merged”: [
“我”,
[“国新能源”, “stock”],
“汽车”,
“产量”,
“突破”,
“2亿”,
“台”,
“产量”,
“，”,
“是”,
“不”,
“是”,
[“聪明人”, “smart”],
…
…
…

期望：
“国新能源” 这个词应该不生效，如何能基于模型分词后的结果，再进行词典的“高效”合并，像 Trie 一样？

hankcs · April 17, 2020, 4:31pm

2.x日程上优先级更高的任务非常多。这种功能，任何人通过阅读《自然语言处理入门》第二章或者1.x已公开的代码应该不难写吧。

wenfeixiang1991 · April 18, 2020, 7:28am

好的，在看第二章