小何博士好,
好久没见,最近换了个联想拯救者Y9000X 2021笔记本,带NVIDIA GeForce RTX 2060 Max-Q GPU,就装起HanLP2.1测试一下。发觉 trie.py的parse_longest()函数只返回value:
def parse_longest(self, text: Sequence[str]) -> List[Tuple[int, int, Any]]:
"""Longest-prefix-matching which tries to match the longest keyword sequentially from the head of the text till
its tail. By definition, the matches won't overlap with each other.
Args:
text: A piece of text. In HanLP's design, it doesn't really matter whether this is a str or a list of str.
The trie will transit on either types properly, which means a list of str simply defines a list of
transition criteria while a str defines each criterion as a character.
Returns:
A tuple of ``(begin, end, value)``.
"""
found = []
i = 0
while i < len(text):
state = self.transit(text[i])
if state:
to = i + 1
end = to
value = state._value
for to in range(i + 1, len(text)):
state = state.transit(text[to])
if not state:
break
if state._value is not None:
value = state._value
end = to + 1
if value is not None:
found.append((i, end, value))
i = end - 1
i += 1
return found
而我在发票货物劳务名称识别的落地应用研究中,大量货物劳务的专有名称需要通过用户自定义词典识别,分词与词性标注后,建立语法树与语义图,然后根据分词与词性标注的结果编写算法提取货物劳务名称。因此需要返回key与value(词/词性),我用了HanLP2.0中相应的函数,改名区别:
# Added by Jean for returning key and value together
def parse_longest2(self, text: Sequence[str]) -> List[Tuple[Union[str, Sequence[str]], Any, int, int]]:
found = []
i = 0
while i < len(text):
state = self.transit(text[i])
if state:
to = i + 1
end = to
value = state._value
for to in range(i + 1, len(text)):
state = state.transit(text[to])
if not state:
break
if state._value is not None:
value = state._value
end = to + 1
if value is not None:
found.append((text[i:end], value, i, end))
i = end - 1
i += 1
return found
我觉得很多人应该有相似的需求,这是个有普遍性的需求。希望HanLP2.1后续的版本中可以合并源码提供这样的支持,谢谢!
最近半年主要在研究发票交易网络分析,所以推迟了对HanLP2.1的测试。从HanLP2.0升级到2.1,一些API稍有不同,花了2天定位错误才跑通了GPU并行分词与词性标注等的测试实例,还好。8-)