A unified approach to sentence segmentation of punctuated text in many languages

This garbage paper applies a transformer on punctuations to binary-classify the sentence boundary and claims some improvements on their evaluation data. Their model is essentially not different from https://konvens.org/proceedings/2019/papers/KONVENS2019_paper_41.pdf but the authors did not cite it even as a baseline. Instead, the baselines they used (NLTK, SpaCy) are all rule-based which are not usable at all. Their statement “It is therefore unsurprising to find many segmentation errors in existing corpora” is simply wrong, because most treebanks are carefully sentence segmented, but these authors did not mention it.


  • I’m surprised to see such a low quality paper on ACL2021.
