Language identification in web pages

Authors:
Bruno Martins;Mário J. Silva
Affiliations:
Faculdade de Ciências Universidade de Lisboa, Lisboa, Portugal;Faculdade de Ciências Universidade de Lisboa, Lisboa, Portugal
Venue:
Proceedings of the 2005 ACM symposium on Applied computing
Year:
2005

Citing 10
Cited 20

The automatic identification of languages using linguistic recognition signals

The automatic identification of languages using linguistic recognition signals
The limits of Web metadata, and beyond

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Mining the Web's Link Structure

Computer
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
An information-theoretic measure for document similarity

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Automatic language identification of written texts

Proceedings of the 2004 ACM symposium on Applied computing
Language determination: natural language processing from scanned document images

ANLC '94 Proceedings of the fourth conference on Applied natural language processing
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Using the structure of HTML documents to improve retrieval

USITS'97 Proceedings of the USENIX Symposium on Internet Technologies and Systems on USENIX Symposium on Internet Technologies and Systems

Challenges and resources for evaluating geographical IR

Proceedings of the 2005 workshop on Geographic information retrieval
New specialist tools for medieval document XML markup

Proceedings of the 2007 ACM symposium on Applied computing
The Viúva Negra crawler: an experience report

Software—Practice & Experience
A Variant of N-Gram Based Language Classification

AI*IA '07 Proceedings of the 10th Congress of the Italian Association for Artificial Intelligence on AI*IA 2007: Artificial Intelligence and Human-Oriented Computing
Web page language identification based on URLs

Proceedings of the VLDB Endowment
Current research issues and trends in non-English Web searching

Information Retrieval
A user-centric approach to identifying best deployment strategies for language tools: the impact of content and access language on Web user behaviour and attitudes

Information Retrieval
Study of some distance measures for language and encoding identification

LD '06 Proceedings of the Workshop on Linguistic Distances
Arabic script language identifications using adaptive neural network

ACST '08 Proceedings of the Fourth IASTED International Conference on Advances in Computer Science and Technology
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Arabic script web page language identifications using decision tree neural networks

Pattern Recognition
Improve feature selection method of web page language identification using fuzzy ARTMAP

International Journal of Intelligent Information and Database Systems
Language identification in multi-lingual web-documents

NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Semi-automatic creation and maintenance of web resources with webtopic

EWMF'05/KDO'05 Proceedings of the 2005 joint international conference on Semantics, Web and Mining
Drive-by language identification: a byproduct of applied prototype semantics

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Yet another language identifier

EACL '12 Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)
Microblog language identification: overcoming the limitations of short, unedited and idiomatic text

Language Resources and Evaluation
Benchmarking web accessibility evaluation tools: measuring the harm of sole reliance on automated tests

Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility
Determining language variant in microblog messages

Proceedings of the 28th Annual ACM Symposium on Applied Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper discusses the problem of automatically identifying the language of a given Web document. Previous experiments in language guessing focused on analyzing "coherent" text sentences, whereas this work was validated on texts from the Web, often presenting harder problems. Our language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure. Both fast and robust, the software has been in use for the past two years, as part of a crawler for a search engine. Experiments show that it achieves very high accuracy in discriminating different languages on Web pages.