An information-theoretic, vector-space-model approach to cross-language information retrieval*

Authors:
Peter a. Chew;Brett w. Bader;Stephen Helmreich;Ahmed Abdelali;Stephen j. Verzi
Affiliations:
Moss adams llp, albuquerque, nm 87110-4189, usa e-mail: peterachew@aol.com;Sandia national laboratories, albuquerque, nm 87185-0519, usa emails: bwbader@sandia.gov, sjverzi@sandia.gov;New mexico state university, new mexico, 88003-8001, usa emails: helmreich@zianet.com, aabdelal@nmsu.edu;New mexico state university, new mexico, 88003-8001, usa emails: helmreich@zianet.com, aabdelal@nmsu.edu;Sandia national laboratories, albuquerque, nm 87185-0519, usa emails: bwbader@sandia.gov, sjverzi@sandia.gov
Venue:
Natural Language Engineering
Year:
2011

Citing 15
Cited 2

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
The significance of the Cranfield tests on index languages

SIGIR '91 Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval
Class-based n-gram models of natural language

Computational Linguistics
Stochastic Complexity in Statistical Inquiry Theory

Stochastic Complexity in Statistical Inquiry Theory
Modern Information Retrieval

Modern Information Retrieval
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Unsupervised learning of the morphology of a natural language

Computational Linguistics
Use of mutual information based character clusters in dictionary-less morphological analysis of Japanese

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Automatic identification of non-compositional phrases

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Text Representation: From Vector to Tensor

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)

TREC: Experiment and Evaluation in Information Retrieval (Digital Libraries and Electronic Publishing)
Cross-language information retrieval using PARAFAC2

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Enhancing multilingual latent semantic analysis with term alignment information

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Tensor Decompositions and Applications

SIAM Review
Finnish, portuguese and russian retrieval with hummingbird SearchServerTM at CLEF 2004

CLEF'04 Proceedings of the 5th conference on Cross-Language Evaluation Forum: multilingual Information Access for Text, Speech and Images

Term weighting schemes for Latent Dirichlet Allocation

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Critiquing text analysis in social modeling: best practices, limitations, and new frontiers

SBP'13 Proceedings of the 6th international conference on Social Computing, Behavioral-Cultural Modeling and Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, we demonstrate several novel ways in which insights from information theory (IT) and computational linguistics (CL) can be woven into a vector-space-model (VSM) approach to information retrieval (IR). Our proposals focus, essentially, on three areas: pre-processing (morphological analysis), term weighting, and alternative geometrical models to the widely used term-by-document matrix. The latter include (1) PARAFAC2 decomposition of a term-by-document-by-language tensor, and (2) eigenvalue decomposition of a term-by-term matrix (inspired by Statistical Machine Translation). We evaluate all proposals, comparing them to a ???standard??? approach based on Latent Semantic Analysis, on a multilingual document clustering task. The evidence suggests that proper consideration of IT within IR is indeed called for: in all cases, our best results are achieved using the information-theoretic variations upon the standard approach. Furthermore, we show that different information-theoretic options can be combined for still better results. A key function of language is to encode and convey information, and contributions of IT to the field of CL can be traced back a number of decades. We think that our proposals help bring IR and CL more into line with one another. In our conclusion, we suggest that the fact that our proposals yield empirical improvements is not coincidental given that they increase the theoretical transparency of VSM approaches to IR; on the contrary, they help shed light on why aspects of these approaches work as they do.