Cluster labeling for multilingual scatter/gather using comparable corpora

Authors:
Goutham Tholpadi;Mrinal Kanti Das;Chiranjib Bhattacharyya;Shirish Shevade
Affiliations:
Computer Science and Automation, Indian Institute of Science, Bangalore, India;Computer Science and Automation, Indian Institute of Science, Bangalore, India;Computer Science and Automation, Indian Institute of Science, Bangalore, India;Computer Science and Automation, Indian Institute of Science, Bangalore, India
Venue:
ECIR'12 Proceedings of the 34th European conference on Advances in Information Retrieval
Year:
2012

Citing 16
Cited 0

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Inferring hierarchical descriptions

Proceedings of the eleventh international conference on Information and knowledge management
Latent dirichlet allocation

The Journal of Machine Learning Research
Centroid-based summarization of multiple documents

Information Processing and Management: an International Journal
A clustering method for news articles retrieval system

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A Concept-Driven Algorithm for Clustering Search Results

IEEE Intelligent Systems
Automatically labeling hierarchical clusters

dg.o '06 Proceedings of the 2006 international conference on Digital government research
Automatic Discovery of Concepts from Text

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Multidocument Summary Generation: Using Informative and Event Words

ACM Transactions on Asian Language Information Processing (TALIP)
Introduction to Information Retrieval

Introduction to Information Retrieval
A survey of Web clustering engines

ACM Computing Surveys (CSUR)
Dynamicity vs. effectiveness: studying online clustering for scatter/gather

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Enhancing cluster labeling using wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Polylingual topic models

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2
Clustering and visualization in a multi-lingual multi-document summarization system

ECIR'03 Proceedings of the 25th European conference on IR research
Prototype hierarchy based clustering for the categorization and navigation of web collections

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.