The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Authors:
Paul B. Kantor;Ellen M. Voorhees
Affiliations:
Department of Library and Information Science, Rutgers University, 4 Huntington St. New Brunswick, NJ 08901, USA;National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA
Venue:
Information Retrieval
Year:
2000

Citing 2
Cited 25

Overview of the second text retrieval conference (TREC-2)

TREC-2 Proceedings of the second conference on Text retrieval conference
The text retrieval conferences (TRECS)

TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998

Improved string matching under noisy channel conditions

Proceedings of the tenth international conference on Information and knowledge management
A Comparison of Text-Based Methods for Detecting Duplication in Scanned Document Databases

Information Retrieval
An Investigation of Mixed-Media Information Retrieval

ECDL '02 Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries
A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Cross-Evaluation: A new model for information system evaluation

Journal of the American Society for Information Science and Technology
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Information Processing and Management: an International Journal
The methodology and an application to fight against Unicode attacks

SOUPS '06 Proceedings of the second symposium on Usable privacy and security
Generating semantic annotations for frequent patterns with context analysis

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Named entity transliteration with comparable corpora

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Semantic annotation of frequent patterns

ACM Transactions on Knowledge Discovery from Data (TKDD)
On ranking techniques for desktop search

ACM Transactions on Information Systems (TOIS)
Successfully detecting and correcting false friends using channel profiles

Proceedings of the second workshop on Analytics for noisy unstructured text data
A study of remembered context for information access from personal digital archives

Proceedings of the second international symposium on Information interaction in context
Revisiting N-Gram Based Models for Retrieval in Degraded Large Collections

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Unsupervised named entity transliteration using temporal and phonetic correlation

EMNLP '06 Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing
Homophones and tonal patterns in English-Chinese transliteration

ACLShort '09 Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
Transliteration alignment

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
Can Chinese phonemes improve machine transliteration?: a comparative study of English-to-Chinese transliteration models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Graphemic approximation of phonological context for English-Chinese transliteration

NEWS '09 Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration
Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

Information Processing and Management: an International Journal
An approach for adding noise-tolerance to restricted-domain information retrieval

NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Machine transliteration survey

ACM Computing Surveys (CSUR)
Comparative information retrieval evaluation for scanned documents

Proceedings of the 15th WSEAS international conference on Computers
Using string comparison in context for improved relevance feedback in different text media

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

A known-item search is a particular information retrieval task in which the system is asked to find a single target document in a large document set. The TREC-5 confusion track used a set of 49 known-item tasks to study the impact of data corruption on retrieval system performance. Two corrupted versions of a 55,600 document corpus whose true content was known were created by applying OCR techniques to page images. The first version of the corpus used the page images as scanned, resulting in an estimated character error rate of approximately 5%. The second version used page images that had been down-sampled, resulting in an estimated character error rate of approximately 20%. The true text and each of the corrupted versions were then searched using the same set of 49 questions. In general, retrieval methods that attempted a probabilistic reconstruction of the original clean text fared better than methods that simply accepted corrupted versions of the query text.