A study in language identification

  • Authors:
  • Rachel Mary Milne;Richard A. O'Keefe;Andrew Trotman

  • Affiliations:
  • University of Otago, Dunedin, New Zealand;University of Otago, Dunedin, New Zealand;University of Otago, Dunedin, New Zealand

  • Venue:
  • Proceedings of the Seventeenth Australasian Document Computing Symposium
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.