A unified approach for artificial intelligence and information retrieval

  • Authors:
  • S K Wong;W Ziarko

  • Affiliations:
  • -;-

  • Venue:
  • ACM SIGIR Forum
  • Year:
  • 1986

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the past, several mathematical models for document retrieval systems have been developed [C82, S83, S83a, T76, WO84]. These models are used to formally represent the basic characteristics, functional components, and the retrieval processes of document retrieval systems. Two basic categories of models that have been employed in information retrieval are the vector processing models and the Boolean retrieval models.In the conventional vector space model (VSM), proposed by Salton [S71, S83] index terms are basic vectors in a vector space. Each document or query is represented as a linear combination of these basic term vectors. The retrieval operation consists of computing the cosine similarity function between a given query vector and the set of document vectors and then ranking documents accordingly. In this approach, the interpretation that the occurrence frequency of a term in a document represents the component of the document vector along the corresponding basic term vectors is made.The advantages of this model are that it is simple and yet powerful. The vector operations can be performed efficiently enough to handle very large collections. Furthermore, it has been shown that the retrieval effectiveness is significantly higher compared to that of the Boolean retrieval models. However, this vector model has been incorporated into very few commercial systems.In the strict Boolean retrieval systems [BU81, P84] the user query normally consists of index terms that are connected by Boolean operators AND, OR and NOT. The advantage of using Boolean connectives is to provide a better structure to formulate the user query. The major problem in such a system is that there is no provision for associating weights of importance to the terms which are assigned either to the documents or to the queries. In other words, the representation is binary, indicating either the presence or the absence of the various index terms. The output obtained in response to a query is not ranked in any order of presumed importance to the user. In most cases, the AND connectives tend to be too restrictive [BU81]. Mose commercially available retrieval systems essentially conform to this model.One of the challenges for researchers in information retrieval has been to achieve greater acceptance of the vector processing models in commercial systems. The main difficulty in this connection is due to the inability of the vector processing systems to handle Boolean queries. In recent years some progress has been made in expressing Boolean queries as vectors [S83a, S83b]. If attractive ways to achieve this are advanced, it would then be possible to modify existing systems to use vector processing techniques without a great deal of cost and effort.Another problem in the conventional vector space model is that it assumes that term vectors are orthogonal. It is generally agreed that terms are correlated and it is necessary to generalize the model to incorporate term correlations. A vector processing model termed the GVSM [WO84a, WO85] was proposed in response to this need. In the GVSM, the queries are assumed to be presented as a list of terms and corresponding weights. Thus, no provision is made for processing Boolean queries. However, the premises of the model naturally lead to a scheme for handling Boolean queries. In this paper we present the details of this scheme. This result will help achieve the aim of integrating vector processing capabilities into existing systems which use Boolean retrieval models.