Categorization and Keyword Identification of Unlabeled Documents

Authors:
Ning Kang;Carlotta Domeniconi;Daniel Barbara
Affiliations:
George Mason University;George Mason University;George Mason University
Venue:
ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Year:
2005

Citing 6
Cited 3

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Clustering and singular value decomposition for approximate indexing in high dimensional spaces

Proceedings of the seventh international conference on Information and knowledge management
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Locally adaptive dimensionality reduction for indexing large time series databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases

From Anomaly Reports to Cases

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Weighted cluster ensembles: Methods and analysis

ACM Transactions on Knowledge Discovery from Data (TKDD)
Unsupervised feature selection for text data

ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning

Quantified Score

Hi-index	0.01

Visualization

Abstract

In this paper we first propose a global unsupervised feature selection approach for text, based on frequent itemset mining. As a result, each document is represented as a set of words that co-occur frequently in the given corpus of documents. We then introduce a locally adaptive clustering algorithm, designed to estimate (local) word relevance and, simultaneously, to group the documents. We present experimental results to demonstrate the feasibility of our approach. Furthermore, the analysis of the weights credited to terms provides evidence that the identified keywords can guide the process of label assignment to clusters. We take into consideration both spam email filtering and general classification datasets. Our analysis of the distribution of weights in the two cases provides insights on how the spam problem distinguishes from the general classification case.