Unsupervised multi-granular Chinese word segmentation and term discovery via...

This paper proposes an unsupervised multi-granular Chinese word segmentation based on graph partition of adjacent characters. They build an adjacent sparse matrix of characters by defining sub-diagonal weights as some variations of n-gram statistics. Then graph partition criteria are applied to group the characters into mutually disjoint components such that the edges crossing the components have low weights compared to the edges within the components. Then each candidate segmentation is feed to a fine-tuned BERT to evaluate its score such that incorrectly segmented strings are removed.

Comments

  • The graph partition idea is quite new.
  • The whole paper is a hodge-podge. These techniques are applied as is without considering their inner relationships. To mention a few, segmenter and BERT discriminator are not learnt jointly. BERT itself can already produce some statistics of connectivity between adjacent characters. To fine-tune BERT, a dictionary is required and you said your study is to construct such a dictionary. So, the chicken and egg dilemma.
  • Their experiments are not convincing. Baselines are over weakened. If you use BERT in your paper you should consider some BERT based segmenters as your baselines. And if you believe your method is unsupervised then you should compare with unsupervised ones like sentencepieces and mutual information.
  • In conclusion, this is a poor paper. Don’t waste time to read it.
Rating
  • 5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
  • 4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
  • 4: Strong: I learned a lot from it. I would like to see it accepted.
  • 3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
  • 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
  • 2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
  • 2: Mediocre: I would rather not see it in the conference.
  • 1.5: Weak: I am pretty confident that it should be rejected.
  • 1: Poor: I would fight to have it rejected.

0 投票人