UCTopic: Unsupervised Contrastive Learning for Phrase Representations
This paper leverages a new contrastive pretraining objective along with cluster assisted contrastive learning to generate high-quality phrase/entity embeddings.
What is the name of the UCTopic paper?
UCTopic: Unsupervised Contrastive Learning for Phrase Representations and Topic Mining
Jiacheng Li, Jingbo Shang, Julian McAuley, UCSD
UCTopic makes two assumptions about phrase semantics
1) Phrase semantics are determined by their contexts
2) Phrases that have the same mentions have the same semantics
- Why does UCTopic have to leverage clusters?
- small number of topics makes in batch negatives unstable
- What are the main contributions of the UCTopic paper
- Unsupervised contrastive learning method for phrase
- cluster assisted negative sampling method
- achieve superior results on entity clustering and topic mining
- UCTopic uses
as a backbone model- transformer based model pretrained for good entity embedding
- When using LUKE as an encoder enter the sentence as well as the left and right boundaries of the phrase of interest
- LUKE is a roBERTa model that is optimized for
entity embeddings
- UCTopic contrastive objective: positive samples are
different contexts for the same phrase
- ex: “He lived on the east coast of the United States” and “How much does it cost to fly to the United States”.
- mask out phrase “UNITED STATES” and use as positive instances
- use in batch negatives
- The UCTopic dataset is English wikipedia with
as phrases- Using wikipedia hyperlinks is a good source of meaningful entities
- UCTopic finetuning is done with
cluster assisted negatives
- compute clusters of phrase representations based on pretrained UCTopic model
- The intuition behind cluster assisted negative sampling in UCTopic is to
mine more different phrases as negative samples
- UCTopic evaluation is done on entity clustering showing that model learns
good contextual embeddings of entity phrases
- Identify Person, Location and Organization entities
- comparable performance to phrase BERT
- How is UCTopic applied to topic mining?
- mine 10k phrases from dataset
- do Kmeans clustering on UCTopic phrase representations
- choose number of clusters using Silhouette Score
- Finetune UCTopic with CCL
UCTopic can be used to generate
more lexically diverse
topic phrases -
How does UCTopic evaluate the results of topical phrase mining?
1) Topical Separation
- human evaluation: phrase intrusion task
2) Phrase Coherence
- based on annotator judgement
3) Phrase informativeness and diversity
- use tfidf
- ratio of distinct words among top phrases
- UCTopic precision at n metric measures if
top phrases within a topic reflect that topic well
It’s interesting to see that clustering on high-quality phrase representations gives good topics. Topic modeling is a niche but important application of short phrase representations. However, as evidenced by the convoluted approaches in this paper, evaluation of topic mining methods is pretty difficult. Overall I find the information retrieval results more compelling and UCTopic outperforms a strong alternate phrase embedding method: PhraseBERT.
title={UCTopic: Unsupervised Contrastive Learning for Phrase Representations
and Topic Mining },
author={Jiacheng Li and Jingbo Shang and Julian McAuley },
journal={arXiv preprint arXiv: Arxiv-2202.13469}