Textual Similarity with a Bag-of-Embedded-Words Model

  • Authors:
  • Stéphane Clinchant;Florent Perronnin

  • Affiliations:
  • Xerox Research Centre Europe;Xerox Research Centre Europe

  • Venue:
  • Proceedings of the 2013 Conference on the Theory of Information Retrieval
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

While words in documents are generally treated as discrete entities, they can be embedded in a Euclidean space which reflects an a priori notion of similarity between them. In such a case, a text document can be viewed as a bag-of-embedded-words (BoEW): a set of real-valued vectors. We propose a novel document representation based on such continuous word embeddings. It consists in non-linearly mapping the word-embeddings in a higher-dimensional space and in aggregating them into a document-level representation. We report retrieval experiments in the case where the word-embeddings are computed from standard topic models showing significant improvements with respect to the original topic models.