GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation

2 minute read

Type: Paper

Introduction

What is the name of the GPL paper?

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval

Kexin Wang, Nandan Thakur, Nils Reimers, Iryna Gurevych
What are the main contributions of the GPL paper?
- GPL technique for unsupervised domain adaptation
- outperform other methods on 6 datasets
- explore different pretraining approaches
One problem with dense neural retrieval methods is that they do not generalize well out of domain
- for example, models trained on MS-MARCO perform poorly on task related to COVID-19
What are some pretraining tasks for dense retrievers?
- In domain continued pretraining
- Inverse cloze task
- ConDensor
- SIMCSE
- Contrastive Tension
- CT
- TSDAE
Contrastive Tension (CT) (Carlsson et al., 2021) passes the input through two different models
- similar to original Assymetric encoder from DPR
ConDensor (CD) (Gao and Callan, 2021) applies MLM on top of the CLS token embedding from the final layer and the other context embeddings from a previous layer
- aim is to force the model to learn meaningful CLS representation.
MoDIR to use Domain Adversarial Training (DAT) (Ganin et al., 2016) for unsupervised domain adaptation of dense retrievers.
- MoDIR trains models by generating domain invariant representations to attack a domain classifier.
- Seems similar to Pavlos Adversarial Variational Domain Adaptive pretraining paper
- Might mess up the embedding space by forcing different domains to align
Reseach shows that cross encoders for dense retrieval are more robust to domain shifts
- however due to efficiency reasons application is limited to a reranking step
One way to leverage cross encoders for efficient retrieval is to use them to train biencoders

Method

The GPL method consists of a query generation step, negative mining using dense retrieval and pseudolabeling with a cross encoder
To work GPL simply needs paragraphs of in domain text
GPL performs query generation using T5
- generate 3 queries for each in domain passage
GPL generates negative samples by using an existing dense retriever
- we have generated queries and retrieve the 50 nearest passages
GPL generates pseudolabels using a cross encoder
- margin is computed between positive instance and negative sample
- these margin scores are saved as pseudolabels
In GPL, the margin MSE loss trains the biencoder to mimic the margin predicted by the cross encoder
- provides robustness to badly generated queries
GPL biencoder model is distilBERT with mean pooling
- use DocT5Query for query generation
- use msmarco-MiniLML-6-v as dense retriever
- use ms-marco-MiniLM-L6 as cross encoder

Results

GPL results: gets best performance combined with TSDAE pretraining
GPL training curve, needs about 100k training steps
One advantage of the GPL method is the fine-grained cross-encoder scores

Conclusions

It would be nice to see benchmarks compared to models with larger hidden sizes as many practical use cases don’t necessarily need to trade off performance for the smaller size of distilBERT.

Reference

@article{Wang2021GPLGP,
  title={GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval},
  author={Kexin Wang and Nandan Thakur and Nils Reimers and Iryna Gurevych},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.07577}
}

Twitter Facebook LinkedIn

Ethan Kim

GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation

Introduction

Method

Results

Conclusions

Reference

You May Also Enjoy

ITERATED DECOMPOSITION: IMPROVING SCIENCE Q&A

Decoder Inference Optimization

1 Year of a Challenging Big-Bench Task

Scattered or Connected? An Optimized Parameter-efficient