CDLM: Cross-Document Language Model

1 minute read

Introduction

  • What is the name of the CDLM paper?

    CDLM: Cross-Document Language Modeling

    AI2

  • What are the main contributions of the CDLM paper?
    • multidocument pretraining tasks
    • dynamic global attention pattern
    • SOTA for some multidocument tasks
  • Why is dealing with multiple texts important in NLP?
    • cross document coreference resolution
    • classifying relations between document pairs
    • multihop question answering

Method

  • CDLM two main ideas are multidocument pretraining and dynamic longformer global attention

  • CDLM can handle any number of documents that fit into longformer context window (4096 tokens)

  • Longformer attention patterns global + sliding window

    Untitled

  • Longformer applies global attention to manually specified tokens

  • CDLM pretrains on document clusters from the multinews dataset
    • dataset originally intended for multidocument summarization
  • CDLM pretraining input is concatenated related documents
    • use special document separator tokens

    Untitled

  • CDLM pretraining: masked token is allowed to attend to full global sequence

  • CDLM ablations: prefix CDLM is BigBird style global attention

    Untitled

Results

  • CDLM results: strong results on Cross Document Coreference Resolution

    Untitled

    • also performs well on document matching tasks

Conclusions

This paper explores a simple method to train language models for document understanding. They leverage the long former backbone model to make use of extended context length.

Reference

@article{caciularu2021cdlm,
  title   = {CDLM: Cross-Document Language Modeling},
  author  = {Avi Caciularu and Arman Cohan and Iz Beltagy and Matthew E. Peters and Arie Cattan and Ido Dagan},
  year    = {2021},
  journal = {arXiv preprint arXiv: Arxiv-2101.00406}
}

Updated: