Succinct and informative cluster descriptions for document repositories

Authors:
Lijun Chen;Guozhu Dong
Affiliations:
Wright State University, Dayton, OH;Wright State University, Dayton, OH
Venue:
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Year:
2006

Citing 15
Cited 1

The Strength of Weak Learnability

Machine Learning
Bagging predictors

Machine Learning
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Attribute-oriented induction in data mining

Advances in knowledge discovery and data mining
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient mining of emerging patterns: discovering trends and differences

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Information Retrieval

Information Retrieval
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Diversity versus Quality in Classification Ensembles Based on Feature Selection

ECML '00 Proceedings of the 11th European Conference on Machine Learning
CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
CAEP: Classification by Aggregating Emerging Patterns

DS '99 Proceedings of the Second International Conference on Discovery Science
Frequent term-based text clustering

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining knowledge from text using information extraction

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
ITERATE: a conceptual clustering algorithm for data mining

IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

Analyzing and tracking weblog communities using discriminative collection representatives

SBP'10 Proceedings of the Third international conference on Social Computing, Behavioral Modeling, and Prediction

Quantified Score

Hi-index	0.00

Visualization

Abstract

Large document repositories need to be organized, summarized and labeled in order to be used effectively. Previous clustering studies focused on organizing, and paid little attention to producing cluster labels. Without informative labels, users need to browse many documents to get a sense of what the clusters contain. Human labeling of clusters is not viable when clustering is performed on demand or for very few users. It is desirable to automatically generate informative cluster descriptions (CDs), in order to give users a high-level sense about the clusters, and to help repository managers to produce the final cluster labels. This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. We propose to use a CD-based classification for simulating how to interpret CDs, and to use the F-score of the classification to measure CD quality. Since directly searching good CDs using F-score is too expensive, we consider a surrogate quality measure, the CDD measure, which combines three factors: coverage, disjointness, and diversity. We give a search strategy for constructing CDs, namely a layer-based replacement method called PagodaCD. Experimental results show that the algorithm is efficient and can produce high quality CDs. CDs produced by PagodaCD also exhibit a monotone quality behavior.