Results of applying probabilistic IR to OCR text
SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Evaluation of model-based retrieval effectiveness with OCR text
ACM Transactions on Information Systems (TOIS)
One-time complete indexing of text: theory and practice
SIGIR '85 Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Probabilistic models of information retrieval based on measuring the divergence from randomness
ACM Transactions on Information Systems (TOIS)
A mathematical theory of communication
ACM SIGMOBILE Mobile Computing and Communications Review
Information Retrieval from Documents: A Survey
Information Retrieval
Probabilistic Retrieval of OCR Degraded Text Using N-Grams
ECDL '97 Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries
Terrier information retrieval platform
ECIR'05 Proceedings of the 27th European conference on Advances in Information Retrieval Research
Improving retrieval of short texts through document expansion
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Hi-index | 0.00 |
We consider the problem of retrieval on noisy text collections. This is a paramount problem for retrieval with new social media collections, like Twitter, where typical messages are short, whilst dictionary is very large in size, plenty of variations of emoticons, term shortcuts and any other type of users jargon. In particular, we propose a new methodology which combines different effective techniques, some of them proposed in the OCR information retrieval literature, such as n-grams tokenization, approximate string matching algorithms, that need to be plugged in suitable IR models of retrieval and query reformulation. To evaluate the methodology we use the OCR degraded collections of the Confusion TREC. Unlike the solutions proposed by the TREC participants, tuned for specific collections and thus exhibiting a high variable performance among the different degradation levels, our model is highly stable. In fact, with the same tuned parameters, it reaches the best or nearly best performance simultaneously on all the three Confusion collections (Original, Degrade 5% and Degrade 20%), with a 33% improvement on the average MAP measure. Thus, it is a good candidate as a universal high precision strategy to be used when there isn't any a priori knowledge of the specific domain. Moreover, our parameters can be specifically tuned in order to obtain the best up to date retrieval performance at all levels of collection degradation, and even on the clean collection, that is the original collection without the OCR errors.