[HanLP 2.1.0-alpha.48 Release] Span Outputs and Fast Mirrors

hankcs · June 3, 2021, 3:47am

We released a minor updated version of HanLP, bringing the following features to you.

Tokenizers can be configured to output the span of each token now. So integration with ElasticSeach becomes possible for v2.1.
- Note that the span ["HanLPv2.1", 6, 15] now aligns with the input text even when there exist spaces.

HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)
HanLP['tok/fine'].config.output_spans = True
doc = HanLP(['2021年 HanLPv2.1 为生产环境带来次世代最先进的多语种NLP技术。', '阿婆主来到北京立方庭参观自然语义科技公司。'], tasks='tok')
print(doc)

{
  "tok/fine": [
    [["2021年", 0, 5], ["HanLPv2.1", 6, 15], ["为", 16, 17], ["生产", 17, 19], ["环境", 19, 21], ["带来", 21, 23], ["次", 23, 24], ["世代", 24, 26], ["最", 26, 27], ["先进", 27, 29], ["的", 29, 30], ["多", 30, 31], ["语种", 31, 33], ["NLP", 33, 36], ["技术", 36, 38], ["。", 38, 39]],
    [["阿婆主", 0, 3], ["来到", 3, 5], ["北京", 5, 7], ["立方庭", 7, 10], ["参观", 10, 12], ["自然", 12, 14], ["语义", 14, 16], ["科技", 16, 18], ["公司", 18, 20], ["。", 20, 21]]
  ]
}

Improvement on apostrophe tokenization for English.
An updated Semantic Textual Similarity model for Chinese.
Mirrors of xlm-roberta-base and bert-base-japanese-char for faster loading.

An upgrade is highly recommended.

Godlikemandyy · June 18, 2021, 5:34am

HanLP[‘tok/fine’].config.output_spans = True
请问大佬，这个命令的作用是什么？

littlebigus · September 22, 2021, 8:28am

可以用于标识分割字段在文本中的位置，默认是关闭的，也就是运行HANLP时不显示，那段代码是打开了这个功能