SimCSE: Simple Contrastive Learning of Sentence Embeddings

1114 · May 20, 2021, 4:40am

Universal sentence embedding is much more robust than supervised semantic textual similarity as it is not bounded to the domain of training set. This paper proposes a surprisingly simple method, SimCSE, to learn such embeddings through matching the representations of the same sentence with different dropout masks.

\newcommand\mr[1]{\mathrm{#1}} \newcommand\mc[1]{\mathcal{#1}} \newcommand\mf[1]{\mathbf{#1}} \begin{equation} \label{eq:unsup_objective} \begin{aligned} % \ell_i = - \log \frac{\exp((\mf{h}_i^z)^\top\mf{h}_i^{z'}) / \tau )}{\sum_{j=1}^N\exp((\mf{h}_i^z)^\top\mf{h}_j^{z_j'}) /\tau)}. \ell_i = - \log \frac{e^{\mr{sim}(\mf{h}_i^{z_i}, \mf{h}_i^{z'_i}) / \tau }}{\sum_{j=1}^Ne^{\mr{sim}(\mf{h}_i^{z_i}, \mf{h}_j^{z_j'}) /\tau}}, \end{aligned} \end{equation}

where \mf{h} s are sentence embeddings, z s are different dropout masks.

Their method can be also enhanced with supervised data as negative samples:

\begin{aligned} - \log \frac{e^{\mr{sim}(\mf{h}_i,\mf{h}_i^+ )/ \tau }}{\sum_{j=1}^N\left(e^{\mr{sim}(\mf{h}_i,\mf{h}_j^+)/\tau}+e^{\mr{sim}(\mf{h}_i,\mf{h}_j^-)/ \tau}\right)}. \end{aligned}

where and +/- indicates positive and negative samples from some entailment datasets respectively.

This objective can be theoretically justified by deriving a lower bound of it and relating its lower bound to the upper-bound of eigenvalues of embedding matrix, stating that the singular spectrum will flatten to promote uniformity.

Comments

It’s rare to read a DL paper with solid theoretical justifications.
I’m wondering if SimCSE is able to improve upon supervised models with a regression layer in fine-tuning settings.

Rating

5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
4: Strong: I learned a lot from it. I would like to see it accepted.
3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
2: Mediocre: I would rather not see it in the conference.
1.5: Weak: I am pretty confident that it should be rejected.
1: Poor: I would fight to have it rejected.

0 投票者

1114 · May 21, 2021, 1:58am

The fine-tuning results on STSB are promising for BERT-Large but not RoBERTa-Large.

	pearson	spearmanr
roberta-large	92.13%	92.02%
w/ unsup-simcse	91.83%	91.78%
w/ sup-simcse	91.71%	91.80%
bert-large	90.10%	89.79%
w/ unsup-simcse	91.06%	90.74%
w/ sup-simcse	91.32%	90.99%

Hyperparameters:

python3 examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path princeton-nlp/sup-simcse-roberta-large \
  --task_name stsb \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --overwrite_output_dir \
  --fp16 \
  --gradient_accumulation_steps 1 \
  --warmup_steps 214 \
  --output_dir data/model/stsb_simcse_ro_sup/