用于自定义语料在hanlp2.0如何训练？

littlebigus · October 1, 2021, 3:04pm

何博士您好，我使用默认语料库进行某一领域的实体识别结果并不是很好，加上自定义词典效果也还是一般，想问一下hanlp2.0是否支持训练自定义语料呢？我在论坛里面查找发现所问的问题都是关于语料标注的，但是自定义语料要如何进行训练呢？是否有类似的接口直接加载呢？在python中的实现方式是什么样子的？

hankcs · October 1, 2021, 5:30pm

当然有，整个框架都是模块化的。例子参考：

github.com

hankcs/HanLP/blob/master/plugins/hanlp_demo/hanlp_demo/zh/train/open_small.py#L55


    CTB8_POS_TRAIN,
    CTB8_POS_DEV,
    CTB8_POS_TEST,
    SortingSamplerBuilder(batch_size=32),
    hard_constraint=True,
    max_seq_len=510,
    char_level=True,
    dependencies='tok',
    lr=1e-3,
),
'ner': TaggingNamedEntityRecognition(
    MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TRAIN,
    MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_DEV,
    MSRA_NER_TOKEN_LEVEL_SHORT_IOBES_TEST,
    SortingSamplerBuilder(batch_size=32),
    max_seq_len=510,
    hard_constraint=True,
    char_level=True,
    lr=1e-3,
    secondary_encoder=RelativeTransformerEncoder(256, k_as_x=True, feedforward_dim=128),
    dependencies='tok',

语料格式参考：https://hanlp.hankcs.com/docs/api/hanlp/datasets/index.html

fastislow · April 12, 2022, 3:04am

我也有同样的场景。在这个 demo 中只留下 NER 任务可以正常运行，但是在 NER 任务中加上参数 crf=TRUE，就会抛出一个异常：

Traceback (most recent call last):
File “/Users/onion/Library/Application Support/IntelliJIdea2019.3/python/helpers/pydev/pydevd.py”, line 1434, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Users/onion/Library/Application Support/IntelliJIdea2019.3/python/helpers/pydev/pydev_imps/pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, ‘exec’), glob, loc)
File “/Users/onion/Work/NLP/HanlpV2/HanLP/plugins/hanlp_demo/hanlp_demo/zh/train/ner_demov4.py”, line 33, in
mtl.fit(
File “/Users/onion/Work/NLP/HanlpV2/HanLP/hanlp/components/mtl/multi_task_learning.py”, line 641, in fit
return super().fit(**merge_locals_kwargs(locals(), kwargs, excludes=(‘self’, ‘kwargs’, ‘class’, ‘tasks’)),
File “/Users/onion/Work/NLP/HanlpV2/HanLP/hanlp/common/torch_component.py”, line 276, in fit
criterion = self.build_criterion(**merge_dict(config, trn=trn))
File “/Users/onion/Work/NLP/HanlpV2/HanLP/hanlp/components/mtl/multi_task_learning.py”, line 265, in build_criterion
return dict((k, v.build_criterion(decoder=self.model.decoders[k], **kwargs)) for k, v in self.tasks.items())
File “/Users/onion/Work/NLP/HanlpV2/HanLP/hanlp/components/mtl/multi_task_learning.py”, line 265, in
return dict((k, v.build_criterion(decoder=self.model.decoders[k], **kwargs)) for k, v in self.tasks.items())
File “/Users/onion/Work/NLP/HanlpV2/HanLP/hanlp/components/taggers/tagger.py”, line 33, in build_criterion
model = self.model
AttributeError: ‘TaggingNamedEntityRecognition’ object has no attribute ‘model’

看起来在 TaggingNamedEntityRecognition 中 model 还未生成的时候，用到了 model 参数？

hankcs · April 12, 2022, 4:57am

感谢反馈，已经修复：

fastislow · April 12, 2022, 7:12am

我还正在看 decode 那里抛出来的异常呢，你这儿已经修复了，很及时