Efficient Nearest-Neighbor Search in the Probability Simplex

Authors:
Kriste Krstovski;David A. Smith;Hanna M. Wallach;Andrew McGregor
Affiliations:
School of Computer Science, University of Massachusetts, Amherst, MA, 01003, U.S.A.;College of Computer and Information Science, Northeastern University, Boston, MA, 02115, U.S.A.;School of Computer Science, University of Massachusetts, Amherst, MA, 01003, U.S.A.;School of Computer Science, University of Massachusetts, Amherst, MA, 01003, U.S.A.
Venue:
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Year:
2013

Citing 27
Cited 0

Elements of information theory

Elements of information theory
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Approximate nearest neighbor queries in fixed dimensions

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
An Algorithm for Finding Best Matches in Logarithmic Expected Time

ACM Transactions on Mathematical Software (TOMS)
Multidimensional binary search trees used for associative searching

Communications of the ACM
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Latent dirichlet allocation

The Journal of Machine Learning Research
Streaming and sublinear approximation of entropy and information distances

SODA '06 Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
LDA-based document models for ad-hoc retrieval

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
An Unsupervised Learning Algorithm for Rank Aggregation

ECML '07 Proceedings of the 18th European conference on Machine Learning
Evaluating topic models for information retrieval

Proceedings of the 17th ACM conference on Information and knowledge management
Sparse Online Learning via Truncated Gradient

The Journal of Machine Learning Research
Studying the history of ideas using topic models

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatic query generation for patent search

Proceedings of the 18th ACM conference on Information and knowledge management
TREC-CHEM: large scale chemical information retrieval evaluation at TREC

ACM SIGIR Forum
Polylingual topic models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Streaming first story detection with application to Twitter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Translingual document representations from discriminative projections

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Proceedings of the 3rd international workshop on Patent information retrieval

International Conference on Information and Knowledge Management
Latent topic feedback for information retrieval

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
A minimally supervised approach for detecting and ranking document translation pairs

WMT '11 Proceedings of the Sixth Workshop on Statistical Machine Translation
Probabilistic score normalization for rank aggregation

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval
Some inequalities for information divergence and related measures of discrimination

IEEE Transactions on Information Theory
Divergence measures based on the Shannon entropy

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document similarity tasks arise in many areas of information retrieval and natural language processing. A fundamental question when comparing documents is which representation to use. Topic models, which have served as versatile tools for exploratory data analysis and visualization, represent documents as probability distributions over latent topics. Systems comparing topic distributions thus use measures of probability divergence such as Kullback-Leibler, Jensen-Shannon, or Hellinger. This paper presents novel analysis and applications of the reduction of Hellinger divergence to Euclidean distance computations. This reduction allows us to exploit fast approximate nearest-neighbor (NN) techniques, such as locality-sensitive hashing (LSH) and approximate search in k-d trees, for search in the probability simplex. We demonstrate the effectiveness and efficiency of this approach on two tasks using latent Dirichlet allocation (LDA) document representations: discovering relationships between National Institutes of Health (NIH) grants and prior-art retrieval for patents. Evaluation on these tasks and on synthetic data shows that both Euclidean LSH and approximate k-d tree search perform well when a single nearest neighbor must be found. When a larger set of similar documents is to be retrieved, the k-d tree approach is more effective and efficient.