Lemmatization 结果不理想

我在 Demos | Demo 里 parse 以下一句:

Leeyay made a whooshing, breathy sound that I thought meant they were either frustrated or sad. “Yes,” they agreed.

结果如下:

不知道为什么前半句没有显示 lemma。

我在本地尝试 RESTful APIs,print(HanLP("Leeyay made a whooshing, breathy sound that I thought meant they were either frustrated or sad. “Yes,” they agreed.")) 执行结果的 “lem” 部分摘录如下:

  "lem": [
    ["leeyay", "made", "a", "whooshing", ",", "breathy", "sound", "that", "i", "thought", "meant", "they", "be", "either", "frustrated", "or", "sad", "."],
    ["“", "yes", ",", "”", "they", "agreed", "."]
  ],

几乎都错了。

线上用的模型太小了,你可以试试线下的UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE

import hanlp
from hanlp_common.document import Document

HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE)
doc: Document = HanLP(
    ['Leeyay made a whooshing, breathy sound that I thought meant they were either frustrated or sad.',
     '“Yes,” they agreed.'],
    tasks='ud'
)
doc.pretty_print()

输出:

Dep Tree        	Token     	Relation  	Lemma     	PoS  
────────────────	──────────	──────────	──────────	─────
             ┌─►	Leeyay    	nsubj     	Leeyay    	PROPN
┌┬───────────┴──	made      	root      	make      	VERB 
││       ┌─────►	a         	det       	a         	DET  
││       │┌─►┌──	whooshing 	amod      	whooshing 	ADJ  
││       ││  └─►	,         	punct     	,         	PUNCT
││       ││  ┌─►	breathy   	amod      	breathy   	ADJ  
│└─►┌────┼┴──┴──	sound     	obj       	sound     	NOUN 
│   │    │  ┌──►	that      	nsubj     	that      	PRON 
│   │    │  │┌─►	I         	nsubj     	I         	PRON 
│   │    └─►└┴──	thought   	acl:relcl 	thought   	VERB 
│   └─►┌────────	meant     	acl:relcl 	mean      	VERB 
│      │   ┌───►	they      	nsubj     	they      	PRON 
│      │   │┌──►	were      	cop       	be        	AUX  
│      │   ││┌─►	either    	cc:preconj	either    	CCONJ
│      └─►┌┴┴┴──	frustrated	ccomp     	frustrated	ADJ  
│         │  ┌─►	or        	cc        	or        	CCONJ
│         └─►└──	sad       	conj      	sad       	ADJ  
└──────────────►	.         	punct     	.         	PUNCT

Dep Tr	Token 	Relat	Lemma	PoS  
──────	──────	─────	─────	─────
┌─►┌──	“Yes  	ccomp	“yes 	INTJ 
│  └─►	,”    	punct	,”   	PUNCT
│  ┌─►	they  	nsubj	they 	PRON 
└──┼──	agreed	root 	agree	VERB 
   └─►	.     	punct	.    	PUNCT

当然,的确还有错误。lemmatization这个任务用neural来做可能是overkill。另一方面,100多种语言的lemmatization model,不如monolingual准确。

1 Like