Web page language identification based on URLs
Proceedings of the VLDB Endowment
Web community analysis and its application to language specific crawling
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
Hi-index | 0.00 |
The Web has been recognized as an important part of our cultural heritage. Many nations started archiving national web spaces for future generations. A key technology for data acquisition employed by these archiving projects is web crawling. Crawling cultural and/or linguistic specific resources from the borderless Web raises many challenging issues. In this paper, we propose the language specific web crawling and evaluate the language specific crawling strategies on the web crawling simulator.