An approach for adding noise-tolerance to restricted-domain information retrieval

  • Authors:
  • Katia Vila;Josval Díaz;Antonio Fernández;Antonio Ferrández

  • Affiliations:
  • University of Matanzas, Department of Informatics, Matanzas, Cuba;University of Matanzas, Department of Informatics, Matanzas, Cuba;University of Matanzas, Department of Informatics, Matanzas, Cuba;University of Alicante, Department of Software and Computing Systems, Alicante, Spain

  • Venue:
  • NLDB'10 Proceedings of the Natural language processing and information systems, and 15th international conference on Applications of natural language to information systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Corpus of Information Retrieval (IR) systems are formed by text documents that often come from rather heterogeneous sources, such as Web sites or OCR (Optical Character Recognition) systems. Faithfully converting these sources into flat text files is not a trivial task, since noise can be easily introduced due to spelling or typeset errors. Importantly, if the size of the corpus is large enough, then redundancy helps in controlling the effects of noise because the same text often appears with and without noise throughout the corpus. Conversely, noise becomes a serious problem in restricted-domain IR where corpus is usually small and it has little or no redundancy. Therefore, noise hinders the retrieval task in restricted domains and erroneous results are likely to be obtained. In order to overcome this situation, this paper presents an approach for using restricted-domain resources, such as Knowledge Organization Systems (KOS), to add noise-tolerance to existing IR systems. To show the suitability of our approach in one real restricted-domain case study, a set of experiments has been carried out for the agricultural domain.