Language identification in web pages

  • Authors:
  • Bruno Martins;Mário J. Silva

  • Affiliations:
  • Faculdade de Ciências Universidade de Lisboa, Lisboa, Portugal;Faculdade de Ciências Universidade de Lisboa, Lisboa, Portugal

  • Venue:
  • Proceedings of the 2005 ACM symposium on Applied computing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper discusses the problem of automatically identifying the language of a given Web document. Previous experiments in language guessing focused on analyzing "coherent" text sentences, whereas this work was validated on texts from the Web, often presenting harder problems. Our language "guessing" software uses a well-known n-gram based algorithm, complemented with heuristics and a new similarity measure. Both fast and robust, the software has been in use for the past two years, as part of a crawler for a search engine. Experiments show that it achieves very high accuracy in discriminating different languages on Web pages.