Improved stable retrieval in noisy collections

  • Authors:
  • Gianni Amati;Alessandro Celi;Cesidio Di Nicola;Michele Flammini;Daniela Pavone

  • Affiliations:
  • Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy

  • Venue:
  • ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of retrieval on noisy text collections. This is a paramount problem for retrieval with new social media collections, like Twitter, where typical messages are short, whilst dictionary is very large in size, plenty of variations of emoticons, term shortcuts and any other type of users jargon. In particular, we propose a new methodology which combines different effective techniques, some of them proposed in the OCR information retrieval literature, such as n-grams tokenization, approximate string matching algorithms, that need to be plugged in suitable IR models of retrieval and query reformulation. To evaluate the methodology we use the OCR degraded collections of the Confusion TREC. Unlike the solutions proposed by the TREC participants, tuned for specific collections and thus exhibiting a high variable performance among the different degradation levels, our model is highly stable. In fact, with the same tuned parameters, it reaches the best or nearly best performance simultaneously on all the three Confusion collections (Original, Degrade 5% and Degrade 20%), with a 33% improvement on the average MAP measure. Thus, it is a good candidate as a universal high precision strategy to be used when there isn't any a priori knowledge of the specific domain. Moreover, our parameters can be specifically tuned in order to obtain the best up to date retrieval performance at all levels of collection degradation, and even on the clean collection, that is the original collection without the OCR errors.