WebKhoj: Indian language IR from multiple character encodings

Authors:
Prasad Pingali;Jagadeesh Jagarlamudi;Vasudeva Varma
Affiliations:
Language Technologies Research Centre, Hyderabad, India;Language Technologies Research Centre, Hyderabad, India;Language Technologies Research Centre, Hyderabad, India
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 12
Cited 9

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Copy detection mechanisms for digital documents

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
The Internet in India: better times ahead?

Communications of the ACM
Searching the Web

ACM Transactions on Internet Technology (TOIT)
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
Unicode for multilingual representation in digital libraries from the indian perspective

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Cross language information retrieval: a research roadmap

ACM SIGIR Forum
Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval, University of Massachusetts Amherst, September 2002

ACM SIGIR Forum
Linguistic resource creation for research and technology development: A recent experiment

ACM Transactions on Asian Language Information Processing (TALIP)
Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing (TALIP)
Keylekh: a keyboard for text entry in indic scripts

CHI '04 Extended Abstracts on Human Factors in Computing Systems
Injecting information into atomic units of text

Proceedings of the 2005 ACM symposium on Document engineering

Issues in searching for Indian language web content

Proceedings of the 2nd ACM workshop on Improving non english web searching
Current research issues and trends in non-English Web searching

Information Retrieval
Transliteration based search engine for multilingual information access

CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Design and implementation-algorithms of Amharic search engine system for Amharic web contents

NTMS'09 Proceedings of the 3rd international conference on New technologies, mobility and security
The FIRE 2008 Evaluation Exercise

ACM Transactions on Asian Language Information Processing (TALIP)
Sub-Word Indexing and Blind Relevance Feedback for English, Bengali, Hindi, and Marathi IR

ACM Transactions on Asian Language Information Processing (TALIP)
Domain specific search in indian languages

Proceedings of the first workshop on Information and knowledge management for developing region
Hindi, telugu, oromo, english CLIR evaluation

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.01

Visualization

Abstract

Today web search engines provide the easiest way to reach information on the web. In this scenario, more than 95% of Indian language content on the web is not searchable due to multiple encodings of web pages.Most of these encodings are proprietary and hence need some kind of standardization for making the content accessible via a search engine. In this paper we present a search engine called WebKhoj which is capable of searching multi-script and multi-encoded Indian language content on the web. We describe a language focused crawler and the transcoding processes involved to achieve accessibility of Indian langauge content. In the end we report some of the experiments that were conducted along with results on Indian language web content.