Cluster labeling for multilingual scatter/gather using comparable corpora

  • Authors:
  • Goutham Tholpadi;Mrinal Kanti Das;Chiranjib Bhattacharyya;Shirish Shevade

  • Affiliations:
  • Computer Science and Automation, Indian Institute of Science, Bangalore, India;Computer Science and Automation, Indian Institute of Science, Bangalore, India;Computer Science and Automation, Indian Institute of Science, Bangalore, India;Computer Science and Automation, Indian Institute of Science, Bangalore, India

  • Venue:
  • ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.