Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
Web page classification without the web page
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Large Margin Methods for Structured and Interdependent Output Variables
The Journal of Machine Learning Research
WebKhoj: Indian language IR from multiple character encodings
Proceedings of the 15th international conference on World Wide Web
Simulation Study of Language Specific Web Crawling
ICDEW '05 Proceedings of the 21st International Conference on Data Engineering Workshops
IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
Web page language identification based on URLs
Proceedings of the VLDB Endowment
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
Language Identification on the Web: Extending the Dictionary Method
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Classifying Documents According to Locational Relevance
EPIA '09 Proceedings of the 14th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Learning URL patterns for webpage de-duplication
Proceedings of the third ACM international conference on Web search and data mining
A characterization of online browsing behavior
Proceedings of the 19th international conference on World wide web
Query forwarding in geographically distributed search engines
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Language identification: the long and the short of the matter
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Kairos: proactive harvesting of research paper metadata from scientific conference web sites
ICADL'10 Proceedings of the role of digital libraries in a time of global change, and 12th international conference on Asia-Pacific digital libraries
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.