Improving document clustering using automated machine translation

Authors:
Xiang Wang;Buyue Qian;Ian Davidson
Affiliations:
UC Davis, Davis, CA, USA;UC Davis, Davis, CA, USA;UC Davis, Davis, CA, USA
Venue:
Proceedings of the 21st ACM international conference on Information and knowledge management
Year:
2012

Citing 13
Cited 1

Matrix analysis

Matrix analysis
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Normalized Cuts and Image Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Nonlinear programming: a historical view

ACM SIGMAP Bulletin
Spectral clustering and transductive learning with multiple views

Proceedings of the 24th international conference on Machine learning
A tutorial on spectral clustering

Statistics and Computing
Constrained Clustering: Advances in Algorithms, Theory, and Applications

Constrained Clustering: Advances in Algorithms, Theory, and Applications
Identifying and generating easy sets of constraints for clustering

AAAI'06 Proceedings of the 21st national conference on Artificial intelligence - Volume 1
NRC's PORTAGE system for WMT 2007

StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation
A co-classification approach to learning from multilingual corpora

Machine Learning
Multi-view clustering of multilingual documents

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Flexible constrained spectral clustering

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Measuring constraint-set utility for partitional clustering algorithms

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases

Identifying multilingual Wikipedia articles based on cross language similarity and activity

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the development of statistical machine translation, we have ready-to-use tools that can translate documents from one language to many other languages. These translations provide different yet correlated views of the same set of documents. This gives rise to an intriguing question: can we use the extra information to achieve a better clustering of the documents? Some recent work on multiview clustering provided positive answers to this question. In this work, we propose an alternative approach to address this problem using the constrained clustering framework. Unlike traditional Must-Link and Cannot-Link constraints, the constraints generated from machine translation are dense yet noisy. We show how to incorporate this type of constraints by presenting two algorithms, one parametric and one non-parametric. Our algorithms are easy to implement, efficient, and can consistently improve the clustering of real data, namely the Reuters RCV1/RCV2 Multilingual Dataset. In contrast to existing multiview clustering algorithms, our technique does not need the compatibility or the conditional independence assumption, nor does it involve subtle parameter tuning.