The automatic identification of languages using linguistic recognition signals
The automatic identification of languages using linguistic recognition signals
The limits of Web metadata, and beyond
WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mining the Web's Link Structure
Computer
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
An information-theoretic measure for document similarity
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Automatic language identification of written texts
Proceedings of the 2004 ACM symposium on Applied computing
Language determination: natural language processing from scanned document images
ANLC '94 Proceedings of the fourth conference on Applied natural language processing
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Using the structure of HTML documents to improve retrieval
USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems
Challenges and resources for evaluating geographical IR
Proceedings of the 2005 workshop on Geographic information retrieval
New specialist tools for medieval document XML markup
Proceedings of the 2007 ACM symposium on Applied computing
The Viúva Negra crawler: an experience report
Software—Practice & Experience
A Variant of N-Gram Based Language Classification
AI*IA '07 Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence on AI*IA 2007: Artificial Intelligence and Human-Oriented Computing
Web page language identification based on URLs
Proceedings of the VLDB Endowment
Current research issues and trends in non-English Web searching
Information Retrieval
Study of some distance measures for language and encoding identification
LD '06 Proceedings of the Workshop on Linguistic Distances
Arabic script language identifications using adaptive neural network
ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
Language identification: the long and the short of the matter
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Improve feature selection method of web page language identification using fuzzy ARTMAP
International Journal of Intelligent Information and Database Systems
Language identification in multi-lingual web-documents
NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Semi-automatic creation and maintenance of web resources with webtopic
EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining
Drive-by language identification: a byproduct of applied prototype semantics
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Yet another language identifier
EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text
Language Resources and Evaluation
Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility
Determining language variant in microblog messages
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
This paper discusses the problem of automatically identifying the language of a given Web document. Previous experiments in language guessing focused on analyzing "coherent" text sentences, whereas this work was validated on texts from the Web, often presenting harder problems. Our language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure. Both fast and robust, the software has been in use for the past two years, as part of a crawler for a search engine. Experiments show that it achieves very high accuracy in discriminating different languages on Web pages.