Applying collocation segmentation to the ACL anthology reference corpus

  • Authors:
  • Vidas Daudaravičius

  • Affiliations:
  • Vytautas Magnus University/Vileikos, Lithuania

  • Venue:
  • ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Collocation is a well-known linguistic phenomenon which has a long history of research and use. In this study I employ collocation segmentation to extract terms from the large and complex ACL Anthology Reference Corpus, and also briefly research and describe the history of the ACL. The results of the study show that until 1986, the most significant terms were related to formal/rule based methods. Starting in 1987, terms related to statistical methods became more important. For instance, language model, similarity measure, text classification. In 1990, the terms Penn Treebank, Mutual Information, statistical parsing, bilingual corpus, and dependency tree became the most important, showing that newly released language resources appeared together with many new research areas in computational linguistics. Although Penn Treebank was a significant term only temporarily in the early nineties, the corpus is still used by researchers today. The most recent significant terms are Bleu score and semantic role labeling. While machine translation as a term is significant throughout the ACL ARC corpus, it is not significant for any particular time period. This shows that some terms can be significant globally while remaining insignificant at a local level.