Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An empirical study of required dimensionality for large-scale latent semantic indexing applications
Proceedings of the 17th ACM conference on Information and knowledge management
A Genre-Aware Approach to Focused Crawling
World Wide Web
Linking e-mails and source code artifacts
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1
Bug localization using latent Dirichlet allocation
Information and Software Technology
Latent semantic indexing (LSI) fails for TREC collections
ACM SIGKDD Explorations Newsletter
IROM: information retrieval-based ontology matching
SAMT'10 Proceedings of the 5th international conference on Semantic and digital media technologies
Regularized Latent Semantic Indexing: A New Approach to Large-Scale Topic Modeling
ACM Transactions on Information Systems (TOIS)
Detecting cyberbullying: query terms and techniques
Proceedings of the 5th Annual ACM Web Science Conference
Hi-index | 0.00 |
Latent Semantic Indexing (LSI) is commonly used to match queries to documents in information retrieval applications. LSI has been shown to improve retrieval performance for some, but not all, collections, when compared to traditional vector space retrieval. In this paper, we first develop a model for understanding which values in the reduced dimensional space contain the term relationship (latent semantic) information. We then test this model by developing a modified version of LSI that captures this information, Essential Dimensions of LSI (EDLSI). EDLSI significantly improves retrieval performance on corpora that previously did not benefit from LSI, and offers improved runtime performance when compared with traditional LSI. Traditional LSI requires the use of a dimensionality reduction parameter which must be tuned for each collection. Applying our model, we have also shown that a small, fixed dimensionality reduction parameter (k=10) can be used to capture the term relationship information in a corpus.