Improved stable retrieval in noisy collections

Authors:
Gianni Amati;Alessandro Celi;Cesidio Di Nicola;Michele Flammini;Daniela Pavone
Affiliations:
Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy;Fondazione Ugo Bordoni, Rome, Italy and Department, of Computer Science, University of L'Aquila, L'Aquila, Italy
Venue:
ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Year:
2011

Citing 9
Cited 1

Results of applying probabilistic IR to OCR text

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of model-based retrieval effectiveness with OCR text

ACM Transactions on Information Systems (TOIS)
One-time complete indexing of text: theory and practice

SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
A mathematical theory of communication

ACM SIGMOBILE Mobile Computing and Communications Review
Information Retrieval from Documents: A Survey

Information Retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams

ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Terrier information retrieval platform

ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research

Improving retrieval of short texts through document expansion

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of retrieval on noisy text collections. This is a paramount problem for retrieval with new social media collections, like Twitter, where typical messages are short, whilst dictionary is very large in size, plenty of variations of emoticons, term shortcuts and any other type of users jargon. In particular, we propose a new methodology which combines different effective techniques, some of them proposed in the OCR information retrieval literature, such as n-grams tokenization, approximate string matching algorithms, that need to be plugged in suitable IR models of retrieval and query reformulation. To evaluate the methodology we use the OCR degraded collections of the Confusion TREC. Unlike the solutions proposed by the TREC participants, tuned for specific collections and thus exhibiting a high variable performance among the different degradation levels, our model is highly stable. In fact, with the same tuned parameters, it reaches the best or nearly best performance simultaneously on all the three Confusion collections (Original, Degrade 5% and Degrade 20%), with a 33% improvement on the average MAP measure. Thus, it is a good candidate as a universal high precision strategy to be used when there isn't any a priori knowledge of the specific domain. Moreover, our parameters can be specifically tuned in order to obtain the best up to date retrieval performance at all levels of collection degradation, and even on the clean collection, that is the original collection without the OCR errors.