Text segmentation algorithms may work better for some tasks than others. For which NLP tasks does HanLP work well or best? Is there a general pattern? Or does it work well for all NLP tasks?
Segmentation itself is a NLP task, so I assume you’re asking which language. Accuracyies on each language are mostly impacted by two major factors.
- The size of corpora used. We annotated the largest amount of Chinese copora so HanLP tokenizers on Chinese could be the best in the world.
- Subword tokenization. As the subword tokenization of BERT/miniLM in English and other latin languages is not linguistically sounding, so it’s difficult to recover from the errors made in subword tokenization stage.
In terms of algorithms, no statistical significances can they make given that all are powered by pretrained language models. You may see lots of “SoTA” papers every month but none of them make any sigificant impacts on a larger dataset. Not to mention lots of them are not reproducible.
Similar stories for other NLP tasks. HanLP is designed to deliver a good balance of accuracy and efficency. So it’s always recommended to test these 2 metrics out on your real world data.
Thank you very much for your answer, which was clearer and more useful than my question! I have been told by people using HanLP that it works very well for text segmentation, and it is my first choice.
I would like to know if there are genres of text on which it works best (e.g., formal language?) and genres of text on which it does not work as well (e.g., Weibos?) and whether there are follow up NLP tasks that it supports well (e.g., sentiment analysis, part-of-speech tagging) and if there are other tasks that it supports less well (e.g.?). By a follow up NLP task, I mean a task that needs segmentation as a first step.
Glad to hear that. Thank you for using HanLP.
First of all, NLP models indeed work best on genres (domains) closest to the corpora they are trained on.
The multilingual tokenizers of HanLP were trained on UD which contains data from the English Web Treebank. It consists of English weblogs, newsgroups, emails, reviews, and question-answers. These data sources should provide some good knowledge of these domains while performance may decrease on other domains (e.g., Twitter).
The Chinese tokenizers
COARSE_ELECTRA_SMALL_ZH were trained on the world’s largest corpus consisting of 100 million chars in various domains including formal newswires and informal social media (Weibo). Unless your domain is rare, HanLP should perform well in most cases. For those hard cases, you can easily integrate your domain knowledge as a dictionary or heuristics into HanLP.
For a task “that needs segmentation as a first step” (POS tagging, chunking, parsing), there is no doubt that some tokenization must be performed no matter how.
By “support” I guess you mean whether HanLP will improve the downstream task accuracy when tokenization is optional. The answer depends on the actual task. Like the sentiment analysis you mentioned. document-level sentiment analysis classifiers do not perform word-based tokenization at all and performing it could marginally improve the accuracy. However, for aspect-based sentiment analysis, integrating a dependency parse usually helps the understanding of aspect-sentiment relations so a tokenizer is recommended.
In summary, the answer depends on whether word-level structures help your task or not.
This is very clear and very impressive about HanLP - thank you!