Query translation by text categorization

Authors:
Patrick Ruch
Affiliations:
SIM, University Hospital of Geneva, Geneva, Switzerland and LITH, Swiss Federal Institute of Technology, Lausanne, Switzerland
Venue:
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Year:
2004

Citing 11
Cited 4

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
CONSTRUE/TIS: A System for Content-Based Indexing of a Database of News Stories

IAAI '90 Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence
Should we translate the documents or the queries in cross-language information retrieval?

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Using contextual spelling correction to improve retrieval effectiveness in degraded text collections

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Term proximity scoring for keyword-based retrieval systems

ECIR'03 Proceedings of the 25th European conference on IR research

Argumentative feedback: a linguistically-motivated term expansion for information retrieval

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
The use of MedGIFT and EasyIR for ImageCLEF 2005

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
ImageCLEF 2004: combining image and multi-lingual search for medical image retrieval

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Translation by text categorisation: medical image retrieval in ImageCLEFmed 2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report on the development of a cross language information retrieval system, which translates user queries by categorizing these queries into terms listed in a controlled vocabulary. Unlike usual automatic text categorization systems, which rely on dataintensive models induced from large training data, our automatic text categorization tool applies data-independent classifiers: a vector-space engine and a pattern matcher are combined to improve ranking of Medical Subject Headings (MeSH). The categorizer also benefits from the availability of large thesauri, where variants of MeSH terms can be found. For evaluation, we use an English collection of MedLine records: OHSUMED. French OHSUMED queries - translated from the original English queries by domain experts- are mapped into French MeSH terms; then we use the MeSH controlled vocabulary as interlingua to translate French MeSH terms into English MeSH terms, which are finally used to query the OHSUMED document collection. The first part of the study focuses on the text to MeSH categorization task. We use a set of MedLine abstracts as input documents in order to tune the categorization system. The second part compares the performance of a machine translation-based cross language information retrieval (CLIR) system with the categorization-based system: the former results in a CLIR ratio close to 60%, while the latter achieves a ratio above 80%. A final experiment, which combines both approaches, achieves a result above 90%.