Self-Training with Weak Supervision

Prior self-training only considers instances covered by weak rules, leaving most valuable unlabeled data out. This paper develops a weak supervision framework to leverage all the available data through teacher model and weighted weak rules.

The student model learns from the labeled data and teacher’s prediction.

\begin{multline} \label{eq:st} \min_\theta\ \mathbb{E}_{x_l, y_l \in D_L} [-\log\ p_\theta (y_l \mid x_l)] + \lambda \mathbb{E}_{x \in D_U} \mathbb{E}_{y \sim q_{\phi^*}(y \mid x)} [-\log\ p_\theta (y \mid x)] \end{multline}

Then the prediction of the student model is combined with other weak rules to make a weighted prediction to train the teacher:

\begin{equation} q_i = \frac{1}{Z_i} \bigg( \sum_{j: \ r^j \in R_i} a^j_i q^j_i + a_i^S p_\theta(y|x_i) + a^u_i u \bigg), \label{eq:ran-aggregation} \end{equation}

where a s are attention scores computed by dot product of a rule embedding and a sample embedding, u is a uniform rule distribution that assigns equal probabilities for all the K classes as u=[\frac{1}{K}, \dots, \frac{1}{K}]. This prediction is optimized as

  1. the cross-entropy between the gold label y_i
  2. and the minimum entropy regularization of q_i such that the weighted prediction is confident and such that the overlap between rules are large.
\begin{equation} \mathcal{L}^{RAN} = -\sum_{(x_i, y_i) \in D_L} y_i \log q_i - \sum_{x_i \in D_U} q_i \log q_i. \label{eq:ran-ssl-objective} \end{equation}

Then the teacher’s prediction is used to train the student. Their method is able to boost the performance of several baselines by a large margin (0.2 to 10 points).

Comments

  • The use of minimum entropy is smart.
  • Snorkel is beaten by a large margin.
Rating
  • 5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
  • 4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
  • 4: Strong: I learned a lot from it. I would like to see it accepted.
  • 3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
  • 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
  • 2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
  • 2: Mediocre: I would rather not see it in the conference.
  • 1.5: Weak: I am pretty confident that it should be rejected.
  • 1: Poor: I would fight to have it rejected.

0 投票者