DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

2 minute read

Introduction

What is the name of the DiffCSE paper?

DiffCSE: Difference-based Contrastive Learning for Sentence

MIT, Meta, Yoon KIm
What are the main contributions of the DIFFCSE paper?
- improvement on STS tasks
- embedding method that is aware of edited sentences

Method

What previous work showed equivariance to be harmful in contrastive learning of text embeddings?
- SIMCSE: using an MLM to replace 15% of words to get positive pairs works less well
- Deletion /Insertion based augmentations work less well
The idea behind DIFFCSE is to be sensitive but not invariant to certain transformations
- encoder should be equivariant but not invariant to MLM based augmentations
DIFFCSE generated augmented instances by masking out random tokens and sampling from a LM
What is the technical definition of equivariance?

$f(T(x)) = T’(f(x)$ where $T’$ is the identity transformation
- we say f is trained to be invariant to T
- Can relax $T’$ to include a broader class of transformations in order to be equivariant
DiffCSE contains a conditional discriminator that predicts that difference between original and edited sentences
- note that there are 3 model components: Encoder, Generator and Discriminator
- ELECTRA style discriminator is conditioned on encoder representation of original sentence
The DIFFCSE finetuning objective combines standard contrastive loss with replaced token detection loss
- RTD loss is backpropagated through the encoder
The intuition behind DIFFCSE is that it trains the encoded sentence representation to distinguish itself from edited transformations

Results

DIFFCSE results: high scores on STS tasks
- note that concurrent work promptBERT gets better results (79.15 STS avg)
DiffCSE senteval results: gets highest average
DIffCSE ablations: using next sentence for RTD loss leads to worse results
- also using CMLM instead of ELECTRA style RTD leads to slightly worse results (but still better than SIMCSE)
DIFFCSE augmentation ablations: MLM performs best
- insertion: ELECTRA has to predict which tokens are inserted

DIFFCSE uses a two layer pooler with batchnorm as a pooling method

two layer MLP

  class ProjectionMLP(nn.Module):
  	def __init__(self, hidden_size):
  		super().__init__()
  		in_dim = hidden_size
  		middle_dim = hidden_size * 2
  		out_dim = hidden_size
  		self.net = nn.Sequential(
  		nn.Linear(in_dim, middle_dim,
  		bias=False),
  		nn.BatchNorm1d(middle_dim),
  		nn.ReLU(inplace=True),
  		nn.Linear(middle_dim, out_dim,
  		bias=False),
  		nn.BatchNorm1d(out_dim,
  		affine=False))

Sentence embeddings from DIFFCSE appear to have better alignment but not uniformity

Conclusions

This work proposes an innovative method for unsupervised learning of sentence embeddings that achieves SOTA results. It is slightly behind the STS results reported by the much simpler method promptBERT. However promptBERT does not report sentEVAL results and thus it might be possible that DIFFCSE is more robust. Overall the approach of including a difference aware pretraining objective is interesting and could generalize to other applications. It’s interesting that transferring lessons from self-supervised learning in the computer vision literature to NLP requires very careful treatments of the types of transformations applied. It’s almost disappointing that so far the SIMCSE dropout augmentation seems to work the best.

Reference

@inproceedings{Chuang2022DiffCSEDC,
  title={DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings},
  author={Yung-Sung Chuang and Rumen Dangovski and Hongyin Luo and Yang Zhang and Shiyu Chang and Marin Soljavci'c and Shang-Wen Li and Wen-tau Yih and Yoon Kim and James R. Glass},
  year={2022}
}

Twitter Facebook LinkedIn

Ethan Kim

DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings

Introduction

Method

Results

Conclusions

Reference

You May Also Enjoy

ITERATED DECOMPOSITION: IMPROVING SCIENCE Q&A

Decoder Inference Optimization

1 Year of a Challenging Big-Bench Task

Scattered or Connected? An Optimized Parameter-efficient