CDLM: Cross-Document Language Model
Introduction
-
What is the name of the CDLM paper?
CDLM: Cross-Document Language Modeling
AI2
- What are the main contributions of the CDLM paper?
- multidocument pretraining tasks
- dynamic global attention pattern
- SOTA for some multidocument tasks
- Why is dealing with multiple texts important in NLP?
- cross document coreference resolution
- classifying relations between document pairs
- multihop question answering
Method
-
CDLM two main ideas are
multidocument pretraining and dynamic longformer global attention
-
CDLM can handle any number of documents that
fit into longformer context window (4096 tokens)
-
Longformer attention patterns
global + sliding window
-
Longformer applies global attention to
manually specified tokens
- CDLM pretrains on document clusters from the
multinews dataset
- dataset originally intended for multidocument summarization
- CDLM pretraining input is
concatenated related documents
- use special document separator tokens
-
CDLM pretraining: masked token is allowed to
attend to full global sequence
-
CDLM ablations: prefix CDLM is
BigBird style global attention
Results
-
CDLM results: strong results on
Cross Document Coreference Resolution
- also performs well on document matching tasks
Conclusions
This paper explores a simple method to train language models for document understanding. They leverage the long former backbone model to make use of extended context length.
Reference
@article{caciularu2021cdlm,
title = {CDLM: Cross-Document Language Modeling},
author = {Avi Caciularu and Arman Cohan and Iz Beltagy and Matthew E. Peters and Arie Cattan and Ido Dagan},
year = {2021},
journal = {arXiv preprint arXiv: Arxiv-2101.00406}
}