Adapting the tf idf vector-space model to domain specific information retrieval

  • Authors:
  • Claire Fautsch;Jacques Savoy

  • Affiliations:
  • University of Neuchatel, Switzerland;University of Neuchatel, Switzerland

  • Venue:
  • Proceedings of the 2010 ACM Symposium on Applied Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

The default implementation in Lucene, an open-source search engine, is the well-known vector-space model with tf idf weighting. The objective of this paper is to propose and evaluate additional techniques that can be adapted to this search model, in order to meet the particular needs of domainspecific information retrieval (IR). In this paper, we suggest certain specificity measures derived from either information theory or corpus-based linguistics. As an additional feature we suggest accounting for the number of search terms that a query and retrieved documents have in common. To integrate these methods we design and implement four extensions to the classical tf idf model and then evaluate the new IR models by applying them to four different domain-specific collections and comparing them to results found by a probabilistic retrieval model. The results tend to demonstrate that the adapted vector-space models clearly outperform the baseline approach (tf idf) and that performance levels obtained even surpass those found in the Okapi model.