A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

Authors:
Eda Baykan;Monika Henzinger;Ingmar Weber
Affiliations:
Izmir University;University of Vienna;Yahoo! Research Barcelona
Venue:
ACM Transactions on the Web (TWEB)
Year:
2013

Citing 18
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Web page classification without the web page

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Large Margin Methods for Structured and Interdependent Output Variables

The Journal of Machine Learning Research
WebKhoj: Indian language IR from multiple character encodings

Proceedings of the 15th international conference on World Wide Web
Simulation Study of Language Specific Web Crawling

ICDEW '05 Proceedings of the 21st International Conference on Data Engineering Workshops
Security Analysis of Authenticated Key Exchange Protocol Based on the q-th Root Problem*This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF-2005-217-C00002).

IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Web page language identification based on URLs

Proceedings of the VLDB Endowment
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Language Identification on the Web: Extending the Dictionary Method

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Classifying Documents According to Locational Relevance

EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
A characterization of online browsing behavior

Proceedings of the 19th international conference on World wide web
Query forwarding in geographically distributed search engines

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Language identification: the long and the short of the matter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Kairos: proactive harvesting of research paper metadata from scientific conference web sites

ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.