An approach for adding noise-tolerance to restricted-domain information retrieval

Authors:
Katia Vila;Josval Díaz;Antonio Fernández;Antonio Ferrández
Affiliations:
University of Matanzas, Department of Informatics, Matanzas, Cuba;University of Matanzas, Department of Informatics, Matanzas, Cuba;University of Matanzas, Department of Informatics, Matanzas, Cuba;University of Alicante, Department of Software and Computing Systems, Alicante, Spain
Venue:
NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
Year:
2010

Citing 10
Cited 1

Modern Information Retrieval

Modern Information Retrieval
The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text

Information Retrieval
Noisy Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Intelligent Text Extraction from PDF Documents

CIMCA '05 Proceedings of the International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce Vol-2 (CIMCA-IAWTIC'06) - Volume 02
Exploring distributional similarity based models for query spelling correction

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Special issue on noisy text analytics

International Journal on Document Analysis and Recognition
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Special issue on noisy text analytics

International Journal on Document Analysis and Recognition - Special Issue NOISY
Question answering system for incomplete and noisy data: methods and measures for its evaluation

ECIR'03 Proceedings of the 25th European conference on IR research
Automatic filtering of bilingual corpora for statistical machine translation

NLDB'05 Proceedings of the 10th international conference on Natural Language Processing and Information Systems

Improving accuracy of identifying clinical concepts in noisy unstructured clinical notes using existing internal redundancy

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Corpus of Information Retrieval (IR) systems are formed by text documents that often come from rather heterogeneous sources, such as Web sites or OCR (Optical Character Recognition) systems. Faithfully converting these sources into flat text files is not a trivial task, since noise can be easily introduced due to spelling or typeset errors. Importantly, if the size of the corpus is large enough, then redundancy helps in controlling the effects of noise because the same text often appears with and without noise throughout the corpus. Conversely, noise becomes a serious problem in restricted-domain IR where corpus is usually small and it has little or no redundancy. Therefore, noise hinders the retrieval task in restricted domains and erroneous results are likely to be obtained. In order to overcome this situation, this paper presents an approach for using restricted-domain resources, such as Knowledge Organization Systems (KOS), to add noise-tolerance to existing IR systems. To show the suitability of our approach in one real restricted-domain case study, a set of experiments has been carried out for the agricultural domain.