A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior

Authors:
Hal Daumé III;Daniel Marcu
Affiliations:
-;-
Venue:
The Journal of Machine Learning Research
Year:
2005

Citing 9
Cited 6

Efficient clustering of high-dimensional data sets with application to reference matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning from Cluster Examples

Machine Learning
A machine learning approach to coreference resolution of noun phrases

Computational Linguistics - Special issue on computational anaphora resolution
Clustering by committee

Clustering by committee
Support vector machine learning for interdependent and structured output spaces

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Bayesian haplo-type inference via the dirichlet process

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Improving machine learning approaches to coreference resolution

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Supervised clustering with support vector machines

ICML '05 Proceedings of the 22nd international conference on Machine learning

Self-taught clustering

Proceedings of the 25th international conference on Machine learning
Sub-class error-correcting output codes

ICVS'08 Proceedings of the 6th international conference on Computer vision systems
Comparing language similarity across genetic and typologically-based groupings

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Large-scale cross-document coreference using distributed inference and hierarchical models

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
The optimum clustering framework: implementing the cluster hypothesis

Information Retrieval
Absolute and relative clustering

Proceedings of the 4th MultiClust Workshop on Multiple Clusterings, Multi-view Data, and Multi-source Knowledge-driven Clustering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We develop a Bayesian framework for tackling the supervised clustering problem, the generic problem encountered in tasks such as reference matching, coreference resolution, identity uncertainty and record linkage. Our clustering model is based on the Dirichlet process prior, which enables us to define distributions over the countably infinite sets that naturally arise in this problem. We add supervision to our model by positing the existence of a set of unobserved random variables (we call these "reference types") that are generic across all clusters. Inference in our framework, which requires integrating over infinitely many parameters, is solved using Markov chain Monte Carlo techniques. We present algorithms for both conjugate and non-conjugate priors. We present a simple---but general---parameterization of our model based on a Gaussian assumption. We evaluate this model on one artificial task and three real-world tasks, comparing it against both unsupervised and state-of-the-art supervised algorithms. Our results show that our model is able to outperform other models across a variety of tasks and performance metrics.