SimCSE: Simple Contrastive Learning of Sentence Embeddings

Universal sentence embedding is much more robust than supervised semantic textual similarity as it is not bounded to the domain of training set. This paper proposes a surprisingly simple method, SimCSE, to learn such embeddings through matching the representations of the same sentence with different dropout masks.

\newcommand\mr[1]{\mathrm{#1}} \newcommand\mc[1]{\mathcal{#1}} \newcommand\mf[1]{\mathbf{#1}} \begin{equation} \label{eq:unsup_objective} \begin{aligned} % \ell_i = - \log \frac{\exp((\mf{h}_i^z)^\top\mf{h}_i^{z'}) / \tau )}{\sum_{j=1}^N\exp((\mf{h}_i^z)^\top\mf{h}_j^{z_j'}) /\tau)}. \ell_i = - \log \frac{e^{\mr{sim}(\mf{h}_i^{z_i}, \mf{h}_i^{z'_i}) / \tau }}{\sum_{j=1}^Ne^{\mr{sim}(\mf{h}_i^{z_i}, \mf{h}_j^{z_j'}) /\tau}}, \end{aligned} \end{equation}

where \mf{h} s are sentence embeddings, z s are different dropout masks.

Their method can be also enhanced with supervised data as negative samples:

\begin{aligned} - \log \frac{e^{\mr{sim}(\mf{h}_i,\mf{h}_i^+ )/ \tau }}{\sum_{j=1}^N\left(e^{\mr{sim}(\mf{h}_i,\mf{h}_j^+)/\tau}+e^{\mr{sim}(\mf{h}_i,\mf{h}_j^-)/ \tau}\right)}. \end{aligned}

where and +/- indicates positive and negative samples from some entailment datasets respectively.

This objective can be theoretically justified by deriving a lower bound of it and relating its lower bound to the upper-bound of eigenvalues of embedding matrix, stating that the singular spectrum will flatten to promote uniformity.

Comments

  • It’s rare to read a DL paper with solid theoretical justifications.
  • I’m wondering if SimCSE is able to improve upon supervised models with a regression layer in fine-tuning settings.
Rating
  • 5: Transformative: This paper is likely to change our field. It should be considered for a best paper award.
  • 4.5: Exciting: It changed my thinking on this topic. I would fight for it to be accepted.
  • 4: Strong: I learned a lot from it. I would like to see it accepted.
  • 3.5: Leaning positive: It can be accepted more or less in its current form. However, the work it describes is not particularly exciting and/or inspiring, so it will not be a big loss if people don’t see it in this conference.
  • 3: Ambivalent: It has merits (e.g., it reports state-of-the-art results, the idea is nice), but there are key weaknesses (e.g., I didn’t learn much from it, evaluation is not convincing, it describes incremental work). I believe it can significantly benefit from another round of revision, but I won’t object to accepting it if my co-reviewers are willing to champion it.
  • 2.5: Leaning negative: I am leaning towards rejection, but I can be persuaded if my co-reviewers think otherwise.
  • 2: Mediocre: I would rather not see it in the conference.
  • 1.5: Weak: I am pretty confident that it should be rejected.
  • 1: Poor: I would fight to have it rejected.

0 投票者

The fine-tuning results on STSB are promising for BERT-Large but not RoBERTa-Large.

pearson spearmanr
roberta-large 92.13% 92.02%
w/ unsup-simcse 91.83% 91.78%
w/ sup-simcse 91.71% 91.80%
bert-large 90.10% 89.79%
w/ unsup-simcse 91.06% 90.74%
w/ sup-simcse 91.32% 90.99%

Hyperparameters:

python3 examples/pytorch/text-classification/run_glue.py \
  --model_name_or_path princeton-nlp/sup-simcse-roberta-large \
  --task_name stsb \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --overwrite_output_dir \
  --fp16 \
  --gradient_accumulation_steps 1 \
  --warmup_steps 214 \
  --output_dir data/model/stsb_simcse_ro_sup/