Transliteration based search engine for multilingual information access

Authors:
Anand Arokia Raj;Harikrishna Maganti
Affiliations:
Bhrigus Software (I) Pvt Ltd, Hyderabad, India;Bhrigus Software (I) Pvt Ltd, Hyderabad, India
Venue:
CLIAWS3 '09 Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies
Year:
2009

Citing 6
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The Internet in India: better times ahead?

Communications of the ACM
Cross language information retrieval: a research roadmap

ACM SIGIR Forum
Hindi CLIR in thirty days

ACM Transactions on Asian Language Information Processing (TALIP)
Keylekh: a keyboard for text entry in indic scripts

CHI '04 Extended Abstracts on Human Factors in Computing Systems
WebKhoj: Indian language IR from multiple character encodings

Proceedings of the 15th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of the Internet data for Indian languages exist in various encodings, causing difficulties in searching for the information through search engines. In the Indian scenario, majority web pages are not searchable or the intended information is not efficiently retrieved by the search engines due to the following: (1) Multiple text-encodings are used while authoring websites. (2) Inspite of Indian languages sharing common phonetic nature, common words like loan words (borrowed from other languages like Sanskrit, Urdu or English), transliterated terms, pronouns etc., can not be searched across languages. (3) Finally the query input mechanism is another major problem. Most of the users hardly know how to type in their native language and prefer to access the information through English based transliteration. This paper addresses all these problems and presents a transliteration based search engine (inSearch) which is capable of searching 10 multi-script and multiencoded Indian languages content on the web.