GIR with language modeling and DFR using Terrier

Authors:
Rocio Guillén
Affiliations:
California State University San Marcos, San Marcos, CA
Venue:
CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Year:
2008

Citing 5
Cited 1

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
GeoCLEF 2008: the CLEF 2008 cross-language geographic information retrieval track overview

CLEF'08 Proceedings of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access
Terrier information retrieval platform

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Monolingual and bilingual experiments in GeoCLEF2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval

A math-aware search engine for math question answering system

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper reports on additional experiments in the Monolingual English, German and Portuguese collections tasks to those described in CLEF2008 Working Notes. Experiments were performed using the language modeling approach and the Divergence From Randomness (DFR) InL2 model as implemented in Terrier (TERabyte RetrIEveR) version 2.1. The main purpose was twofold: 1) to compare these approaches to determine their impact on performance retrieval and 2) to compare results from these experiments with the results generated in the first set of experiments to determine whether query expansion and the presence or absence of diacritic marks have an impact on performance retrieval. The stopword list provided by Terrier was used to index all the collections. We removed diacritic marks from the topics and collections for German and Portuguese before indexing and retrieval. Topics were processed automatically and the query tags specified were the title and the description. Query expansion was included using the 20 top ranked documents and 40 terms. These parameters were selected arbitrarily. Results show that the DFR InL2 model outperformed language modeling for all the languages. Results of the new experiments with query expansion show an improvement in performance retrieval for all the languages. They also suggest that removing diacritic marks may also have an impact in the case of German and Portuguese.