Hanlp model licencing question

RogerDavidClarke · May 6, 2020, 12:03pm

Hi,

Hello,

My apologies, I don’t speak Chinese, so hopefully someone will be able to answer my question. I understand that Hanlp operates under the Apace licence. My question is: Does the licence over only use of the Python package or all of the models (tokeniser and tagger) that I can download?

For example:
tokenizer = hanlp.load(‘PKU_NAME_MERGED_SIX_MONTHS_CONVSEG’)
tagger = hanlp.load(hanlp.pretrained.pos.CTB5_POS_RNN_FASTTEXT_ZH)

Thanks,
Roger

hankcs · May 7, 2020, 12:40am

Hi Roger,

Thank you for asking. It’s an open question, I’m not sure whether the model trained on some corpus must inherit the licence of the corpus or not. Stanford University is in exactly the same situation as us. They said that

The copyright and licensing status of machine learning models is not very clear (to us). We list in the table below the Treebank License of the underlying data from which each language pack (set of machine learning models for a treebank) was trained. To the extent that The Trustees of Leland Stanford Junior University have ownership and rights over these language packs, all these Stanza language packs are made available under the Open Data Commons Attribution License v1.0.

We have the research licence for the corpora but the licence doesn’t permit commercial use. It’s better to assume the Apache Licence doesn’t apply to the models.

RogerDavidClarke · May 7, 2020, 8:33am

thanks for the detailed response. Much appreciated.

gary02 · May 31, 2023, 6:40am

我想问下目前的fine_electra_small_20220615_231803是什么License，基于什么语料库训练得到的？谢谢！@ hankcs

gary02 · May 31, 2023, 7:10am

tok — HanLP Documentation 中有提到fine_electra_small_20220615_231803是用 fine-grained CWS corpora 训练的，但我没能搜到CWS corpora的信息 @hankcs （上面估计没@到）

hankcs · June 2, 2023, 7:25pm

发布了更准确的中文粗分/细分模型 HanLP

最初通过问卷调查【投票】HanLP2.1的细分标准是否不够细？发现75%的用户对HanLP的细分标准表示满意，剩下的25%统一觉得细分标准有点粗。另外，也有AMR用户和SRL用户表示相应模型对否定句式的处理不佳： amr解析日期错误 · Issue #1740 · hankcs/HanLP · GitHub SRL对于否定成分的处理经过调查，都与细分模型的标准有关。为此，我们组织人力对内部1亿字的语料库进行了校对。分别以CTB和MSR两套标准作为细分和粗分的参考，对语料库中不规范的样本进行校正。在校正后的语料上，我们训练并发布了small体积的分词模型，并且取得了可喜的准确率提升：粗分F1 细分F1 旧版 96.92 97.42 新版 98.30 98.11 从此，HanLP的分词标准将以此次更新为基石。MTL模型正在训练中，预计一周后和线上restful服务同步更新。最后，感谢广大用户和HanLP一起追求最专业最先进的NLP技术。

由于我们是上述语料库的作者，我们现在将该模型的license定为 Apache License 2.0，不存在版权争议。