Identifying comparable corpora using LDA

  • Authors:
  • Judita Preiss

  • Affiliations:
  • University of Sheffield, Sheffield, United Kingdom

  • Venue:
  • NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallel corpora have applications in many areas of Natural Language Processing, but are very expensive to produce. Much information can be gained from comparable texts, and we present an algorithm which, given any bodies of text in multiple languages, uses existing named entity recognition software and topic detection algorithm to generate pairs of comparable texts without requiring a parallel corpus training phase. We evaluate the system's performance firstly on data from the online newspaper domain, and secondly on Wikipedia cross-language links.