LIA at INEX 2010 book track

  • Authors:
  • Romain Deveaud;Florian Boudin;Patrice Bellot

  • Affiliations:
  • Laboratoire Informatique d'Avignon, University of Avignon, CERI, Avignon Cedex 9;Laboratoire Informatique d'Avignon, University of Avignon, CERI, Avignon Cedex 9;Laboratoire Informatique d'Avignon, University of Avignon, CERI, Avignon Cedex 9

  • Venue:
  • INEX'10 Proceedings of the 9th international conference on Initiative for the evaluation of XML retrieval: comparative evaluation of focused retrieval
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper we describe our participation and present our contributions in the INEX 2010 Book Track. Digitized books are now a common source of information on the Web, however OCR sometimes introduces errors that can penalize Information Retrieval. We propose a method for correcting hyphenations in the books and we analyse its impact on the Best Books for Reference task. The observed improvement is around 1%. This year we also experimented different query expansion techniques. The first one consists of selecting informative words from a Wikipedia page related to the topic. The second one uses a dependency parser to enrich the query with the detected phrases using a Markov Random Field model. We show that there is a significant improvement over the state-of-the-art when using a large weighted list of Wikipedia words, meanwhile hyphenation correction has an impact on their distribution over the book corpus.