Techniques for improved lsi text retrieval

  • Authors:
  • William Grosky;Farshad Fotouhi;Hua Yan

  • Affiliations:
  • Wayne State University;Wayne State University;Wayne State University

  • Venue:
  • Techniques for improved lsi text retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

This work identifies and studies four major issues in LSI (Latent Semantic Indexing) text retrieval: a multiplicity of standard query methods, alternative non-standard query methods, the issue of Generic Terms, and the lacking of Structural Data. Firstly, three commonly-used standard query methods (versions A, B and B') are identified, compared, analyzed, and tested. Both mathematical analysis and experimental results reveal that version B is a better choice than version A, and that versions B and B' are essentially equivalent provided that the Equivalency Principle is satisfied. This finding shall eliminate the confusion and randomness of applying possibly incompatible query methods among LSI researchers and help restore the comparability of their works. Secondly, some novel non-standard versions of query methods using the discovered technique of singular value rescaling (SVR) are proposed and studied. Testing results in the prototyping experimental environments and the standardized TREC data sets both confirmed the effectiveness of SVR. This finding bears the practical significance that the current information retrieval techniques may be significantly improved by simply adopting a novel query method which is computationally as efficient as the best standard query method. Thirdly, this work studies the effects of Generic Terms, a minority group of terms that have relatively uniform distribution pattern among all topics of documents, on the LSI models. Characterization and definition of Generic Terms are achieved and an iterative algorithm is designed and implemented to identify these special terms. Experimental results strongly suggest that identification and exclusion of Generic Terms helps improve LSI text retrieval performance. Fourthly, this work also studies how to integrate Structural Data (loosely defined as sentence structure) into the LSI models. Four major characteristics of Structural Data are identified: derivativity, maneuverability, language dependency, and updatability/downdatability. Qualifications of two candidate forms of Structural Data, i.e., word order and non-word-order syntax (both in English language), are carefully studied. A complete series of procedures are developed to fully integrate Structural Data (in its most qualified form of word order data) into the LSI models. Experimental results strongly suggest that acquisition and integration of Structural Data helps improve LSI text retrieval performance.