Textual Similarity with a Bag-of-Embedded-Words Model

Authors:
Stéphane Clinchant;Florent Perronnin
Affiliations:
Xerox Research Centre Europe;Xerox Research Centre Europe
Venue:
Proceedings of the 2013 Conference on the Theory of Information Retrieval
Year:
2013

Citing 10
Cited 0

Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Exploiting generative models in discriminative classifiers

Proceedings of the 1998 conference on Advances in neural information processing systems II
Probabilistic models of information retrieval based on measuring the divergence from randomness

ACM Transactions on Information Systems (TOIS)
Latent dirichlet allocation

The Journal of Machine Learning Research
A unified architecture for natural language processing: deep neural networks with multitask learning

Proceedings of the 25th international conference on Machine learning
Curriculum learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
PLSI: The True Fisher Kernel and beyond

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Improving the fisher kernel for large-scale image classification

ECCV'10 Proceedings of the 11th European conference on Computer vision: Part IV
Learning word vectors for sentiment analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
Regularized latent semantic indexing

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

While words in documents are generally treated as discrete entities, they can be embedded in a Euclidean space which reflects an a priori notion of similarity between them. In such a case, a text document can be viewed as a bag-of-embedded-words (BoEW): a set of real-valued vectors. We propose a novel document representation based on such continuous word embeddings. It consists in non-linearly mapping the word-embeddings in a higher-dimensional space and in aggregating them into a document-level representation. We report retrieval experiments in the case where the word-embeddings are computed from standard topic models showing significant improvements with respect to the original topic models.