Simultaneous similarity learning and feature-weight learning for document clustering

Authors:
Pradeep Muthukrishnan;Dragomir Radev;Qiaozhu Mei
Affiliations:
University of Michigan;University of Michigan;University of Michigan
Venue:
TextGraphs-6 Proceedings of TextGraphs-6: Graph-based Methods for Natural Language Processing
Year:
2011

Citing 13
Cited 1

A statistical approach to machine translation

Computational Linguistics
On Clustering Using Random Walks

FST TCS '01 Proceedings of the 21st Conference on Foundations of Software Technology and Theoretical Computer Science
Some Notes on Alternating Optimization

AFSS '02 Proceedings of the 2002 AFSS International Conference on Fuzzy Systems. Calcutta: Advances in Soft Computing
Cluster ensembles: a knowledge reuse framework for combining partitionings

Eighteenth national conference on Artificial intelligence
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Multiple kernel learning, conic duality, and the SMO algorithm

ICML '04 Proceedings of the twenty-first international conference on Machine learning
A phrase-based, joint probability model for statistical machine translation

EMNLP '02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10
Spectral clustering and transductive learning with multiple views

Proceedings of the 24th international conference on Machine learning
Weighted Graph Cuts without Eigenvectors A Multilevel Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
WordNet::Similarity: measuring the relatedness of concepts

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
The ACL Anthology Network corpus

NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Edge Weight Regularization over Multiple Graphs for Similarity Learning

ICDM '10 Proceedings of the 2010 IEEE International Conference on Data Mining

Towards an ACL anthology corpus with logical document structure: an overview of the ACL 2012 contributed task

ACL '12 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Quantified Score

Hi-index	0.00

Visualization

Abstract

A key problem in document classification and clustering is learning the similarity between documents. Traditional approaches include estimating similarity between feature vectors of documents where the vectors are computed using TF-IDF in the bag-of-words model. However, these approaches do not work well when either similar documents do not use the same vocabulary or the feature vectors are not estimated correctly. In this paper, we represent documents and keywords using multiple layers of connected graphs. We pose the problem of simultaneously learning similarity between documents and keyword weights as an edge-weight regularization problem over the different layers of graphs. Unlike most feature weight learning algorithms, we propose an unsupervised algorithm in the proposed framework to simultaneously optimize similarity and the keyword weights. We extrinsically evaluate the performance of the proposed similarity measure on two different tasks, clustering and classification. The proposed similarity measure outperforms the similarity measure proposed by (Muthukrishnan et al., 2010), a state-of-the-art classification algorithm (Zhou and Burges, 2007) and three different baselines on a variety of standard, large data sets.