A study in language identification

Authors:
Rachel Mary Milne;Richard A. O'Keefe;Andrew Trotman
Affiliations:
University of Otago, Dunedin, New Zealand;University of Otago, Dunedin, New Zealand;University of Otago, Dunedin, New Zealand
Venue:
Proceedings of the Seventeenth Australasian Document Computing Symposium
Year:
2012

Citing 1
Cited 1

Bootstrapped language identification for multi-site internet domains

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining

The seventeenth australasian document computing symposium

ACM SIGIR Forum

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language identification is automatically determining the language that a previously unseen document was written in. We compared several prior methods on samples from the Wikipedia and the EuroParl collections. Most of these methods work well. But we identify that these (and presumably other document) collections are heterogeneous in size, and short documents are systematically different from large ones. That techniques that work well on long documents are different from those that work well on short ones. We believe that improvement in algorithms will be seen if length is taken into account.