Dice Loss for Data-imbalanced NLP Tasks

Throughout this paper, binary classification is used to demonstrate the properties of each loss.

Let X denote a set of training instances and each instance x_i\in X is associated with a golden binary label y_i=[y_{i0},y_{i1}] denoting the ground-truth class x_i belongs to, and p_i=[p_{i0},p_{i1}] is the predicted probabilities of the two classes respectively, where y_{i0},y_{i1}\in \{0,1\}, p_{i0},p_{i1}\in[0,1] and p_{i1}+p_{i0}=1.

Thus, y is a binary indicator and it equals 1 only if the prediction is correct.

Cross Entropy Loss

The vanilla cross entropy (CE) loss is given by:

\begin{equation} \text{CE} =-\frac{1}{N} \sum_{i}\sum_{j\in\{0,1\}}y_{ij}\log {p}_{ij} \label{eq1} \end{equation}

Its downsides are:

  1. None of negative examples (y_{i1}=0) has any impact at all.
  2. Negative examples that can be easily classified dominate the training.

Dice Coefficient and Tversky Index

Dice coefficient is a measurement of the similarity between two sets.

\begin{equation} \begin{aligned} \text{DSC}(A,B) &= \frac{2|A\cap B|}{|A|+|B|}\\ \text{DSC} & =\frac{2\text{TP}}{2\text{TP}+\text{FN}+\text{FP}} = \frac{2 \frac{\text{TP}}{\text{TP}+\text{FN}} \frac{\text{TP}}{\text{TP}+\text{FP}} }{\frac{\text{TP}}{\text{TP}+\text{FN}}+\frac{\text{TP}}{\text{TP}+\text{FP}}}\\&= \frac{2 \text{Pre} \times \text{Rec}}{\text{Pre+Rec}}= F1 \end{aligned} \end{equation}

When applied to an example x_i, its coefficient is given as:

\begin{equation} \text{DSC}(x_i) = \frac{2{p}_{i1} y_{i1}}{ {p}_{i1}+ y_{i1}} \label{dssafc} \end{equation}

It’s still not good since negative examples still make no difference.

But once a constant factor \gamma is added to both the nominator and the denominator, it will start working.

\begin{equation} \text{DSC}(x_i) = \frac{2{p}_{i1} y_{i1}+\gamma}{ {p}_{i1}+ y_{i1}+\gamma} \end{equation}

The coefficient of negative samples then becomes \frac{\gamma}{ {p}_{i1}+\gamma}.

With a trick of setting the order of the terms of the denominator to square, we have the following dice loss which is said to converge faster:

\begin{equation} \text{DL}=\frac{1}{N} \sum_{i} \left[ 1 - \frac{2{p}_{i1} y_{i1}+\gamma}{{p}_{i1}^2 + y_{i1}^2+\gamma}\right] \end{equation}

Self-adjusting Dice Loss

The problem with Eq.4 is that, the easy negative samples will easily dominate the training in an unbalanced dataset. To address this, the authors propose to multiply the p with a decaying factor (1-p) to have the following adaptive variant:

\begin{equation} \text{DSC}(x_i) = \frac{2(1-p_{i1}) p_{i1}\cdot y_{i1}+\gamma}{(1-p_{i1}) p_{i1}+y_{i1}+\gamma} \label{adjust} \end{equation}

For confident samples, their p will be very close to 0 or 1, the effective probability (which is (1-p)p) will be very small.

Overall Recommendation

  • 5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
  • 4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
  • 4: Strong: I learned a lot from it. I would like to see it accepted.
  • 3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
  • 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
  • 2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
  • 2: Mediocre: I would rather not see it in the conference.
  • 1.5: Weak: I am pretty confident that it should be rejected.
  • 1: Poor: I would fight to have it rejected.