UCTopic: Unsupervised Contrastive Learning for Phrase Representations

2 minute read

Introduction

This paper leverages a new contrastive pretraining objective along with cluster assisted contrastive learning to generate high-quality phrase/entity embeddings.

What is the name of the UCTopic paper?

UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining

Jiacheng Li, Jingbo Shang, Julian McAuley, UCSD
UCTopic makes two assumptions about phrase semantics

1) Phrase semantics are determined by their contexts

2) Phrases that have the same mentions have the same semantics
Why does UCTopic have to leverage clusters?
- small number of topics makes in batch negatives unstable
What are the main contributions of the UCTopic paper
- Unsupervised contrastive learning method for phrase
- cluster assisted negative sampling method
- achieve superior results on entity clustering and topic mining

Method

UCTopic uses LUKE as a backbone model
- transformer based model pretrained for good entity embedding
- When using LUKE as an encoder enter the sentence as well as the left and right boundaries of the phrase of interest
LUKE is a roBERTa model that is optimized for entity embeddings
UCTopic contrastive objective: positive samples are different contexts for the same phrase
- ex: “He lived on the east coast of the United States” and “How much does it cost to fly to the United States”.
- mask out phrase “UNITED STATES” and use as positive instances
- use in batch negatives
The UCTopic dataset is English wikipedia with hyperlinks as phrases
- Using wikipedia hyperlinks is a good source of meaningful entities
UCTopic finetuning is done with cluster assisted negatives
- compute clusters of phrase representations based on pretrained UCTopic model
The intuition behind cluster assisted negative sampling in UCTopic is to mine more different phrases as negative samples

Results

UCTopic evaluation is done on entity clustering showing that model learns good contextual embeddings of entity phrases
- Identify Person, Location and Organization entities
- comparable performance to phrase BERT
How is UCTopic applied to topic mining?
- mine 10k phrases from dataset
- do Kmeans clustering on UCTopic phrase representations
  - choose number of clusters using Silhouette Score
- Finetune UCTopic with CCL
UCTopic can be used to generate more lexically diverse topic phrases
How does UCTopic evaluate the results of topical phrase mining?

1) Topical Separation
- human evaluation: phrase intrusion task
2) Phrase Coherence
- based on annotator judgement
3) Phrase informativeness and diversity
- use tfidf
- ratio of distinct words among top phrases
UCTopic precision at n metric measures if top phrases within a topic reflect that topic well

Conclusions

It’s interesting to see that clustering on high-quality phrase representations gives good topics. Topic modeling is a niche but important application of short phrase representations. However, as evidenced by the convoluted approaches in this paper, evaluation of topic mining methods is pretty difficult. Overall I find the information retrieval results more compelling and UCTopic outperforms a strong alternate phrase embedding method: PhraseBERT.

Reference

@article{li2022uctopic,
title={UCTopic: Unsupervised Contrastive Learning for Phrase Representations
  and Topic Mining },
author={Jiacheng Li and Jingbo Shang and Julian McAuley },
year={2022},
journal={arXiv preprint arXiv: Arxiv-2202.13469}
}

Twitter Facebook LinkedIn

Ethan Kim

UCTopic: Unsupervised Contrastive Learning for Phrase Representations

Introduction

Method

Results

Conclusions

Reference

You May Also Enjoy

ITERATED DECOMPOSITION: IMPROVING SCIENCE Q&A

Decoder Inference Optimization

1 Year of a Challenging Big-Bench Task

Scattered or Connected? An Optimized Parameter-efficient