Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
The significance of the Cranfield tests on index languages
SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Class-based n-gram models of natural language
Computational Linguistics
Stochastic Complexity in Statistical Inquiry Theory
Stochastic Complexity in Statistical Inquiry Theory
Modern Information Retrieval
The mathematics of statistical machine translation: parameter estimation
Computational Linguistics - Special issue on using large corpora: II
Unsupervised learning of the morphology of a natural language
Computational Linguistics
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Automatic identification of non-compositional phrases
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Text Representation: From Vector to Tensor
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Cross-language information retrieval using PARAFAC2
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Enhancing multilingual latent semantic analysis with term alignment information
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Tensor Decompositions and Applications
SIAM Review
Finnish, portuguese and russian retrieval with hummingbird SearchServerTM at CLEF 2004
CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images
Term weighting schemes for Latent Dirichlet Allocation
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Critiquing text analysis in social modeling: best practices, limitations, and new frontiers
SBP'13 Proceedings of the 6th international conference on Social Computing, Behavioral-Cultural Modeling and Prediction
Hi-index | 0.00 |
In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ???standard??? approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.