Linguini: language identification for multilingual documents

Authors:
John M. Prager
Affiliations:
-
Venue:
Journal of Management Information Systems - Special section: Exploring the outlands of the MIS discipline
Year:
1999

Citing 13
Cited 6

Full text indexing based on lexical relations an application: software libraries

SIGIR '89 Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval
Ranking algorithms

Information retrieval
MURAX: a robust linguistic approach for question answering using an on-line encyclopedia

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Stemming algorithms: a case study for detailed evaluation

Journal of the American Society for Information Science - Special issue: evaluation of information retrieval systems
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Computing Optimal Attribute Weight Settings for Nearest NeighborAlgorithms

Artificial Intelligence Review - Special issue on lazy learning
Cross-Language Information Retrieval

Cross-Language Information Retrieval
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Disambiguation of proper names in text

ANLC '97 Proceedings of the fifth conference on Applied natural language processing

Malay language document identification using BPNN

NN'09 Proceedings of the 10th WSEAS international conference on Neural networks
Self- or pre-tuning?: deep linguistic processing of language variants

DeepLP '07 Proceedings of the Workshop on Deep Linguistic Processing
Malay document analysis and recognition

WSEAS Transactions on Information Science and Applications
Improving mention detection robustness to noisy input

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Factors that affect the accuracy of text-based language identification

Computer Speech and Language
A high performance centroid-based classification approach for language identification

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given the vast and still growing availability of electronic documents from around the world, it is becoming increasingly important for managers of the information systems on which these documents are stored to sort or tag these documents so that their end users can most readily access those documents that are of most interest and use to them, which in our context means in a language they can understand. Linguini is a vector-space-based categorizer tailored for high-precision language identification. This paper determines the functional dependencies of Linguini's performance and demonstrates that it can identify the language of documents as short as 5 to 10 percent of the size of average Web documents with 100 percent accuracy. It also describes how to determine if a document is in two or more languages, without incurring any appreciable extra computational overhead. This approach can be applied equally to subject-categorization systems to distinguish between cases where, when the system recommends two or more categories, the document belongs strongly to all or really to none.