NER任务模型考虑外部词典

dengxc1220 · December 8, 2019, 3:57am

感谢 hankcs提供的这本方便我们初学者学习实操的好书！
我想问在NER任务中，训练出来的CRF等模型可以作为分词器来对文本进行分词，并且在分词的时候考虑某个领域的词典吗？

hankcs · December 8, 2019, 5:49am

不客气，CRF等模型都有词法分析接口，HanLP中的分词器都支持自定义词典。

lingvisa · January 8, 2020, 5:46pm

有自定义用户词典的提高分词效果的示例没有？用户自定义词典的词条是用于学习模型的特征使用，还是直接用于分词？

lingvisa · January 9, 2020, 6:20pm

“HanLP中的分词器都支持自定义词典”, 请问，自定义词典在 2.0 版本是不是还在开发中？pynlp 中好像有。

hankcs · January 10, 2020, 7:33am

github.com

hankcs/HanLP/blob/b9d02cf4ca07bcfa88c55fe25b03e1c478387984/tests/demo/zh/demo_cws_trie.py#L13


# Date: 2019-12-28 21:25
from hanlp.common.trie import Trie


import hanlp


tokenizer = hanlp.load('PKU_NAME_MERGED_SIX_MONTHS_CONVSEG')
text = 'NLP统计模型没有加规则，聪明人知道自己加。英文、数字、自定义词典统统都是规则。'
print(tokenizer(text))


trie = Trie()
trie.update({'自定义': 'custom', '词典': 'dict', '聪明人': 'smart'})




def split_sents(text: str, trie: Trie):
    words = trie.parse_longest(text)
    sents = []
    pre_start = 0
    offsets = []
    for word, value, start, end in words:
        if pre_start != start:
            sents.append(text[pre_start: start])

lingvisa · January 19, 2020, 8:22am

我跑这个例子，但得到的结果是 None:
In [37]: from hanlp.common.trie import Trie
…:
…: import hanlp
…:
…: tokenizer = hanlp.load(‘PKU_NAME_MERGED_SIX_MONTHS_CONVSEG’)
…: text = ‘NLP统计模型没有加规则，聪明人知道自己加。英文、数字、自定义词典统统都是规则。’
…: print(tokenizer(text))
…:
…: trie = Trie()
…: trie.update({‘自定义’: ‘custom’, ‘词典’: ‘dict’, ‘聪明人’: ‘smart’})
…:
…:
…: def split_sents(text: str, trie: Trie):
…: words = trie.parse_longest(text)
…: sents = []
…: pre_start = 0
…: offsets = []
…: for word, value, start, end in words:
…: if pre_start != start:
…: sents.append(text[pre_start: start])
…:

[‘NLP’, ‘统计’, ‘模型’, ‘没有’, ‘加’, ‘规则’, ‘，’, ‘聪明人’, ‘知道’, ‘自己’, ‘加’, ‘。’, ‘英文’, ‘、’, ‘数字’, ‘、’, ‘自定义’, ‘词典’,
‘统统’, ‘都’, ‘是’, ‘规则’, ‘。’]

**In [38]: print(split_sents(text, trie)) **
None

这个函数是展示用自定义词典来断句吗？这个词典怎么跟haNLP 自己的默认分词算法配合使用？

hankcs · January 19, 2020, 4:16pm

github.com

hankcs/HanLP/blob/b9d02cf4ca07bcfa88c55fe25b03e1c478387984/tests/demo/zh/demo_cws_trie.py#L43




def merge_parts(parts, offsets, words):
    items = [(i, p) for (i, p) in zip(offsets, parts)]
    items += [(start, [word]) for (word, value, start, end) in words]
    # In case you need the tag, use the following line instead
    # items += [(start, [(word, value)]) for (word, value, start, end) in words]
    return [each for x in sorted(items) for each in x[1]]


tokenizer = hanlp.pipeline() \
    .append(split_sents, output_key=('parts', 'offsets', 'words'), trie=trie) \
    .append(tokenizer, input_key='parts', output_key='tokens') \
    .append(merge_parts, input_key=('tokens', 'offsets', 'words'), output_key='merged')

print(tokenizer(text))

How it feels to watch a user test your product for the first time

bigbugbang · September 26, 2020, 7:17am

请问下命名实体识别能否采用挂载自定义词典的方式？我用现成的模型测试的时候有部分名字是无法识别的，如果可以采用挂载自定义词典的方式，那么也是用这个方法加自定义词典吗？我看你举的例子是采用键值对的方式挂载，那么命名实体的key应该是什么，value应该是什么？

lan2720 · April 27, 2021, 8:50am

同问，2.0版本可以增加自定义的NER吗？

littlebigus · September 21, 2021, 7:13am

您好，想问一下2.1版本的hanlp是否还支持以这样的形式加载自定义词典呢？我有试过这段代码，但是出现了问题，words = trie.parse_longest(text)返回的是a set of (begin, end, value)，但是在for word, value, start, end in words:这句代码中就不适用了，我有将它改为for start,end ,value in words:，可是这样的话，就不能返回word了。
不知这段代码在2.1环境下要怎么修改呢？

littlebigus · September 21, 2021, 8:07am

我知道了，hancks老师对trie.py文件进行了修改，详见trie文件的修改，我的问题是在这个帖子里面看到的[https://bbs.hankcs.com/t/topic/2953/11](https://bbs.hankcs.com/t/topic/2953/11)

hankcs · September 21, 2021, 1:05pm

2.1新增了dict_whitelist，简化了这个流程：

https://hanlp.hankcs.com/docs/api/hanlp/components/mtl/tasks/ner/tag_ner.html?highlight=dict_whitelist

github.com

hankcs/HanLP/blob/4eaf7eef5eb43a241841fc5fa77fe4561b6eb78d/plugins/hanlp_demo/hanlp_demo/zh/demo_ner_dict.py#L7


# -*- coding:utf-8 -*-
# Author: hankcs
# Date: 2021-04-29 11:06
import hanlp
HanLP = hanlp.load(hanlp.pretrained.mtl.CLOSE_TOK_POS_NER_SRL_DEP_SDP_CON_ELECTRA_BASE_ZH)
HanLP['ner/msra'].dict_whitelist = {'午饭后': 'TIME'}
doc = HanLP('2021年测试高血压是138，时间是午饭后2点45，低血压是44', tasks='ner/msra')
doc.pretty_print()
print(doc['ner/msra'])
# See https://hanlp.hankcs.com/docs/api/hanlp/components/mtl/tasks/ner/tag_ner.html

littlebigus · September 22, 2021, 2:24am

哇，这个设计太棒了，流程更加清晰了

zhubohan19981126 · September 23, 2021, 1:44am

小白拯救者，谢谢老师

littlebigus · October 21, 2021, 5:54am

想问一下老师，hanlp2.0的分词中是否可以以这种形式加载用户词典呢，我直接用HanLP[‘tok/fine’].dict_whitelist=self_dict 是没有效果的，不知道要怎么做呢？

songsh · March 17, 2022, 7:28am

有些词加了，是没有效果

shguan2018 · September 11, 2024, 4:10am

类似于一年内两年内，三年内的自定义实体词怎么加呢？不支持正则表达式

shguan2018 · September 13, 2024, 1:14pm

现在是ner finetune 成功,但是加载模型报错

shguan2018 · September 13, 2024, 1:16pm

github.com

hankcs/HanLP/blob/doc-zh/plugins/hanlp_demo/hanlp_demo/zh/train/finetune_ner.py

# -*- coding:utf-8 -*-
# Author: hankcs
# Date: 2023-10-18 18:49
import os

import hanlp
from hanlp.components.ner.transformer_ner import TransformerNamedEntityRecognizer
from tests import cdroot

cdroot()

your_training_corpus = 'data/ner/finetune/word_to_iobes.tsv'
your_development_corpus = your_training_corpus  # Use a different one in reality
save_dir = 'data/ner/finetune/model'

if not os.path.exists(your_training_corpus):
    os.makedirs(os.path.dirname(your_training_corpus), exist_ok=True)
    with open(your_training_corpus, 'w') as out:
        out.write(
'''训练\tB-NLP

This file has been truncated. show original

shguan2018 · September 19, 2024, 11:49am

终于搞定成功了，ner.load()