Web page language identification based on URLs

Authors:
Eda Baykan;Monika Henzinger;Ingmar Weber
Affiliations:
Ecole Polytechnique Fédéral de Lausanne, LTAA, Lausanne, Switzerland;Ecole Polytechnique Fédéral de Lausanne & Google, LTAA, Lausanne, Switzerland;Ecole Polytechnique Fédéral de Lausanne, LTAA, Lausanne, Switzerland
Venue:
Proceedings of the VLDB Endowment
Year:
2008

Citing 3
Cited 3

Web page classification without the web page

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Simulation Study of Language Specific Web Crawling

ICDEW '05 Proceedings of the 21st International Conference on Data Engineering Workshops

Classifying Documents According to Locational Relevance

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
The missing links: discovering hidden same-as links among a billion of triples

Proceedings of the 12th International Conference on Information Integration and Web-based Applications & Services
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given only the URL of a web page, can we identify its language? This is the question that we examine in this paper. Such a language classifier is, for example, useful for crawlers of web search engines, which frequently try to satisfy certain language quotas. To determine the language of uncrawled web pages, they have to download the page, which might be wasteful, if the page is not in the desired language. With URL-based language classifiers these redundant downloads can be avoided. We apply a variety of machine learning algorithms to the language identification task and evaluate their performance in extensive experiments for five languages: English, French, German, Spanish and Italian. Our best methods achieve an F-measure, averaged over all languages, of around .90 for both a random sample of 1,260 web page from a large web crawl and for 25k pages from the ODP directory. For 5k pages of web search engine results we even achieve an F-measure of .96. The achieved recall for these collections is .93, .88 and .95 respectively. Two independent human evaluators performed considerably worse on the task, with an F-measure of .75 and a typical recall of a mere .67. Using only country-code top-level domains, such as .de or .fr yields a good precision, but a typical recall of below .60 and an F-measure of around .68.