Geometric and topological approaches to semantic text retrieval

  • Authors:
  • Chung-Ping Kwong;Dandan Li

  • Affiliations:
  • The Chinese University of Hong Kong (Hong Kong);The Chinese University of Hong Kong (Hong Kong)

  • Venue:
  • Geometric and topological approaches to semantic text retrieval
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the vast amount of textual information available today, the task of designing effective and efficient retrieval methods becomes more important and complex. The Basic Vector Space Model (BVSM) is well known in information retrieval. Unfortunately, it can not retrieve all relevant documents since it is based on literal term matching. The Generalized Vector Space Model (GVSM) and the Latent Semantic Indexing (LSI) are two famous semantic retrieval methods, in which some underlying latent semantic structures in the dataset are assumed. However, their assumptions about where the semantic structure locates are a bit strong. Moreover, the performance of LSI can be very different for various datasets and the questions of what characteristics of a dataset and why these characteristics contribute to this difference have not been fully understood. The present thesis focuses on providing answers to these two questions. In the first part of this thesis, we present a new understanding of the latent semantic space of a dataset from the dual perspective, which relaxes the above assumed conditions and leads naturally to a unified kernel function for a class of vector space models. New semantic analysis methods based on the unified kernel function are developed, which combine the advantages of LSI and GVSM. We also show that the new methods possess the stable property on the rank choice, i.e., even if the selected rank is quite far away from the optimal one, the retrieval performance will not degrade much. The experimental results of our methods on the standard test sets are promising. In the second part of this thesis, we propose that the mathematical structure of simplexes can be attached to a term-document matrix in the vector-space model (VSM) for information retrieval. The Q-analysis devised by R. H. Atkin may then be applied to effect an analysis of the topological structure of the simplexes and their corresponding dataset. Experimental results of this analysis reveal that there is a correlation between the effectiveness of LSI and the topological structure of the dataset. By using the information obtained from the topological analysis, we develop a new query expansion method. Experimental results show that our method can enhance the performance of VSM for datasets over which LSI is not effective. Finally, the notion of homology is introduced to the topological analysis of datasets and its possible relation to word sense disambiguation is studied through a simple example.