Linguini: Language Identification for Multilingual Documents

Authors:
John M. Prager
Affiliations:
-
Venue:
HICSS '99 Proceedings of the Thirty-Second Annual Hawaii International Conference on System Sciences-Volume 2 - Volume 2
Year:
1999

Citing 0
Cited 4

Evaluation of a language identification system for mono- and multilingual text documents

Proceedings of the 2006 ACM symposium on Applied computing
Language identification of search engine queries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Language identification in multi-lingual web-documents

NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Bangla/English script identification based on analysis of connected component profiles

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present in this paper Linguini, a vector-space based categorizer tailored for high-precision language identification. We show how the accuracy depends on the size of the input document, the set of languages under consideration, and the features used. We found that Linguini could identify the language of documents as short as 5- 10% of the size of average Web documents with 100% accuracy. We also present how to determine if a document is in two or more languages, and in what proportions, without incurring any appreciable computational overhead beyond the monolingual analysis. This approach can be applied to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.