Domain specific search in indian languages

  • Authors:
  • Pattisapu Nikhil Priyatam;Srikanth Reddy Vaddepally;Vasudeva Varma

  • Affiliations:
  • IIIT-Hyderabad, Hyderabad, India;IIIT-Hyderabad, Hyderabad, India;IIIT-Hyderabad, Hyderabad, India

  • Venue:
  • Proceedings of the first workshop on Information and knowledge management for developing region
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Focused crawling has wide number of applications in the area of Information Retrieval. It is a crucial part in building domain specific search engines, personalized search tools and extending digital libraries. Be it Google Scholar to search for scholarly articles or Google news to search for news articles, domain specific search is the most widely acclaimed application of focused crawling. Unfortunately, there are very few domain specific search engines available for Indian languages. Sandhan is one such project which offers domain specific search for tourism and health domains across 10 major Indian languages. The amount of Indian language content on web is less compared to other languages. When we restrict the search space to a specific domain (say tourism) the probability of finding relevant pages reduces. Hence recall plays a major role in such a scenario. Due to the tendency of Indian language web pages linking to other language pages usually English, traditional crawling methods with well chosen seeds would end up crawling a lot of unnecessary content. This means that to gain a little recall we need to sacrifice precision and lot of resources. In this work we try to explore ways of gathering Indian language tourism and health pages from the web for Sandhan using a language and domain specific focused crawler. With this setup we crawl the web extensively for Indian language tourism and health pages. We use different evaluation metrics to evaluate the quality of our crawl - precision, recall and harvest ratio. Using our approach we save nearly 80% resources (disk space, bandwidth, processing time) while maintaining a recall of 0.74 and 0.58 for tourism and health domains respectively.