TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced...

This technique report introduces TexSmart, a NLU system that support NER/FGNER and clustering. TexSmart is built on top of clustering of tokens. They first mined many is-a relations from the web and cluster them into thousands of categories. These categories are manually given a hierarchical label. During testing, a mention and its context is taken as input to compute the similarity against each cluster to predict its fine grained label.


  • The authors are wrapping vanilla clustering technique with lots of fancy terms like semantic expansion. But once you read that section, it’s nothing else but clustering.
  • Their Fine-Grained NER module is interesting to me. However, the most interesting part is not the technique but the hierarchical ontology. I understand their reason to manually label clusters instead of re-using some WordNet ontologies but I wonder if their clusters are as intuitive as WordNet.
  • They should not sell their clustering as “knowledge base” because their clusters provide only “is-a” relation but a KB usually offers lots more!
  • The rest of their modules are not interesting to me as most of them are outdated or underperforming the transformer models.
