Clustering Documents with Active Learning Using Wikipedia

Authors:
Anna Huang;David Milne;Eibe Frank;Ian H. Witten
Affiliations:
-;-;-;-
Venue:
ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
Year:
2008

Citing 0
Cited 7

Clustering Documents Using a Wikipedia-Based Concept Representation

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Text document clustering with metric learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Unsupervised feature weighting based on local feature relatedness

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
A multi-layer text classification framework based on two-level representation model

Expert Systems with Applications: An International Journal
Promoting ranking diversity for biomedical information retrieval using wikipedia

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Learning a concept-based document similarity measure

Journal of the American Society for Information Science and Technology
Selecting keywords to represent web pages using Wikipedia information

Proceedings of the 18th Brazilian symposium on Multimedia and the web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.