PTM: probabilistic topic mapping model for mining parallel document collections

  • Authors:
  • Duo Zhang;Jimeng Sun;ChengXiang Zhai;Abhijit Bose;Nikos Anerousis

  • Affiliations:
  • University of Illinois at Urbana-Champaign, Urbana, IL, USA;IBM T.J. Watson Research Center, Watson, NY, USA;University of Illinois at Urbana-Champaign, Urbana, IL, USA;IBM T.J. Watson Research Centern, Watson, NY, USA;IBM T.J. Watson Research Centern, Watson, NY, USA

  • Venue:
  • CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

Many applications generate a large volume of parallel document collections. A parallel document collection consists of two sets of documents where the documents in each set correspond to each other and form semantic pairs (e.g., pairs of problem and solution descriptions in a help-desk setting). Although much work has been done on text mining, little previous work has attempted to mine such a novel kind of text data. In this paper, we propose a new probabilistic topic model, called Probabilistic Topic Mapping (PTM) model, to mine parallel document collections to simultaneously discover latent topics in both sets of documents as well as the mapping of topics in one set to those in the other. We evaluate the PTM model on one real parallel document collection in IT service domain. We show that PTM can effectively discover meaningful topics, as well as their mappings, and it's also useful for improving text matching and retrieval when there's a vocabulary gap.