[HanLP 2.1.0-alpha.48 Release] Span Outputs and Fast Mirrors

We released a minor updated version of HanLP, bringing the following features to you.

  • Tokenizers can be configured to output the span of each token now. So integration with ElasticSeach becomes possible for v2.1.
    • Note that the span ["HanLPv2.1", 6, 15] now aligns with the input text even when there exist spaces.
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)
HanLP['tok/fine'].config.output_spans = True
doc = HanLP(['2021年 HanLPv2.1 为生产环境带来次世代最先进的多语种NLP技术。', '阿婆主来到北京立方庭参观自然语义科技公司。'], tasks='tok')
print(doc)

{
  "tok/fine": [
    [["2021年", 0, 5], ["HanLPv2.1", 6, 15], ["为", 16, 17], ["生产", 17, 19], ["环境", 19, 21], ["带来", 21, 23], ["次", 23, 24], ["世代", 24, 26], ["最", 26, 27], ["先进", 27, 29], ["的", 29, 30], ["多", 30, 31], ["语种", 31, 33], ["NLP", 33, 36], ["技术", 36, 38], ["。", 38, 39]],
    [["阿婆主", 0, 3], ["来到", 3, 5], ["北京", 5, 7], ["立方庭", 7, 10], ["参观", 10, 12], ["自然", 12, 14], ["语义", 14, 16], ["科技", 16, 18], ["公司", 18, 20], ["。", 20, 21]]
  ]
}
  • Improvement on apostrophe tokenization for English.

  • An updated Semantic Textual Similarity model for Chinese.

  • Mirrors of xlm-roberta-base and bert-base-japanese-char for faster loading.

An upgrade is highly recommended.

HanLP[‘tok/fine’].config.output_spans = True
请问大佬,这个命令的作用是什么?

可以用于标识分割字段在文本中的位置,默认是关闭的,也就是运行HANLP时不显示,那段代码是打开了这个功能