Domain specific search in indian languages

Authors:
Pattisapu Nikhil Priyatam;Srikanth Reddy Vaddepally;Vasudeva Varma
Affiliations:
IIIT-Hyderabad, Hyderabad, India;IIIT-Hyderabad, Hyderabad, India;IIIT-Hyderabad, Hyderabad, India
Venue:
Proceedings of the first workshop on Information and knowledge management for developing region
Year:
2012

Citing 11
Cited 0

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Intelligent crawling on the World Wide Web with arbitrary predicates

Proceedings of the 10th international conference on World Wide Web
Evaluating topic-driven web crawlers

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Focused Crawling by Learning HMM from User's Topic-specific Browsing

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
Topical web crawlers: Evaluating adaptive algorithms

ACM Transactions on Internet Technology (TOIT)
A General Evaluation Framework for Topical Crawlers

Information Retrieval
Learning to crawl: Comparing classification schemes

ACM Transactions on Information Systems (TOIS)
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
WebKhoj: Indian language IR from multiple character encodings

Proceedings of the 15th international conference on World Wide Web
Pattern Recognition and Machine Learning (Information Science and Statistics)

Pattern Recognition and Machine Learning (Information Science and Statistics)
A New Approach to Design Domain Specific Ontology Based Web Crawler

ICIT '07 Proceedings of the 10th International Conference on Information Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Focused crawling has wide number of applications in the area of Information Retrieval. It is a crucial part in building domain specific search engines, personalized search tools and extending digital libraries. Be it Google Scholar to search for scholarly articles or Google news to search for news articles, domain specific search is the most widely acclaimed application of focused crawling. Unfortunately, there are very few domain specific search engines available for Indian languages. Sandhan is one such project which offers domain specific search for tourism and health domains across 10 major Indian languages. The amount of Indian language content on web is less compared to other languages. When we restrict the search space to a specific domain (say tourism) the probability of finding relevant pages reduces. Hence recall plays a major role in such a scenario. Due to the tendency of Indian language web pages linking to other language pages usually English, traditional crawling methods with well chosen seeds would end up crawling a lot of unnecessary content. This means that to gain a little recall we need to sacrifice precision and lot of resources. In this work we try to explore ways of gathering Indian language tourism and health pages from the web for Sandhan using a language and domain specific focused crawler. With this setup we crawl the web extensively for Indian language tourism and health pages. We use different evaluation metrics to evaluate the quality of our crawl - precision, recall and harvest ratio. Using our approach we save nearly 80% resources (disk space, bandwidth, processing time) while maintaining a recall of 0.74 and 0.58 for tourism and health domains respectively.