Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

2 minute read

Introduction

What is the name of the HPD paper?

Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

Xuandong Zhao, Zhiguo Yu, Ming Wu, Lei Li
What are the main contributions of the HPD paper?
- competitive performance with large vectors while being 8 times faster
- Gain on STS benchmarks compared to similar sized models
HPD method distills information from a pretrained teacher model
- note: PCA is precomputed from teacher model
- teacher model is not updated during training

Method

The HPD method uses the same contrastive objective as SIMCSE
- note: they train on NLI pairs similar to sentence transformers
- in batch hard negatives
HPD method compresses the teacher embedding vectors using PCA
- PCA on average pooled embeddings from m sentences
In HPD the student model is guided using MSE with teacher embeddings
- teacher embeddings are downsampled using PCA
In HPD the PCA dimensions are determined using a randomly sampled subset of 100k sentences
HPD applied data augmentation using wordnet substitution and back translation
- expand NLI training set to 3 million pairs
HPD explores projection and whitening as dimension reduction baselines
- projection is one additional linear layer than downsamples to a lower dimension
- whitening rotates into eigenbases

Results

HPD results: competitive performance with vector size of only 128
HPD shows decent gains in performance from their simple augmentation method

Conclusions

This paper primarily presents a simple distillation method for sentence embedding models. The primary idea is to distill a sentence embedding model by training a student model against the PCA-reduced embeddings from a student model. It’s important to note that only the teacher model is trained first, then the student model. Therefore this approach could be used post hoc to distill any larger sentence embedding model. An interesting ablation that could be performed would be to distill the student model against a teacher with either the projection or whitening dimension reduction methods. This experiment would better convince me of the merits of using PCA for the downsampling in the teacher model. I would be interested in distilling the latest sentence embedding models like promptBERT and DCPCSE. And as the authors state it would be interesting to explore the isotropy of the distilled vectors

Reference

@misc{https://doi.org/10.48550/arxiv.2203.07687,
  doi = {10.48550/ARXIV.2203.07687},
  
  url = {https://arxiv.org/abs/2203.07687},
  
  author = {Zhao, Xuandong and Yu, Zhiguo and Wu, Ming and Li, Lei},
  
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {Creative Commons Attribution 4.0 International}
}

Twitter Facebook LinkedIn

Ethan Kim

Compressing Sentence Representation for Semantic Retrieval via Homomorphic Projective Distillation

Introduction

Method

Results

Conclusions

Reference

You May Also Enjoy

ITERATED DECOMPOSITION: IMPROVING SCIENCE Q&A

Decoder Inference Optimization

1 Year of a Challenging Big-Bench Task

Scattered or Connected? An Optimized Parameter-efficient