Hanlp分词得到token如何map到原来string的offset?

比如sequence labeling的场景就有这种需要,需对齐 ground truth labeling 和 hanlp tokenization。

文档结尾:

1 Like

您好,感谢回复。我想在multi task pipeline里面使用,好像不灵
我的code:

import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_MMINILMV2L6)
HanLP.config.output_spans = True
result = HanLP(['round trip fares from baltimore to philadelphia less than 1000 dollars round trip', 
       'give me all flights from new york city to las vegas that arrive on a sunday'])
print(result['tok'][0])

Output: (no offset)

['round', 'trip', 'fares', 'from', 'baltimore', 'to', 'philadelphia', 'less', 'than', '1000', 'dollars', 'round', 'trip']

你得看MTL的教程:

1 Like

感谢!附上正确code

import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_MMINILMV2L6) 
HanLP['tok'].config.output_spans = True

然后直接使用HanLP 就可以看到token offset了 :grinning: