Information retrieval and large text structured corpora

  • Authors:
  • Fco. Mario Barcala;Miguel A. Molinero;Eva Domínguez

  • Affiliations:
  • Centro Ramón Piñeiro, Santiago de Compostela, Spain;Depto. de Informática, Universidade de Vigo, Ourense, Spain;Centro Ramón Piñeiro, Santiago de Compostela, Spain

  • Venue:
  • EUROCAST'05 Proceedings of the 10th international conference on Computer Aided Systems Theory
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Conventional Information Retrieval Systems (IRSs), also called text indexers, deal with plain text documents or ones with a very elementary structure. These kinds of system are able to solve queries in a very efficient way, but they cannot take into account tags which mark different sections, or at best this capability is very limited. In contrast with this, nowadays, documents which are part of a corpus often have a rich structure. They are structured using XML (Extensible Markup Language)[1] or in some other format which can be converted to XML in a more or less simple way. So, building classical IRSs to work with these kinds of corpus will not benefit from this structure and results will not be improved. In addition, several of these corpora are very large and include hundreds or thousands of documents which in turn include millions or hundreds of millions of words. Therefore, there is the need to build efficient and flexible IRSs which work with large structured corpora.