Efficient algorithms for clustering and classifying high dimensional text and discretized data using interesting patterns

  • Authors:
  • John R. Kender;Hassan H. Malik

  • Affiliations:
  • Columbia University;Columbia University

  • Venue:
  • Efficient algorithms for clustering and classifying high dimensional text and discretized data using interesting patterns
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recent advances in data mining allow for exploiting patterns as the primary means for clustering and classifying large collections of data. In this thesis, we present three advances in pattern-based clustering technology, an advance in semi-supervised pattern-based classification, and a related advance in pattern frequency counting. In our first contribution, we analyze numerous deficiencies with traditional pattern significance measures such as support and confidence, and propose a web image clustering algorithm that uses an objective interestingness measure to identify significant patterns, yielding measurably better clustering quality. In our second contribution, we introduce the notion of closed interesting itemsets, and show that these itemsets provide significant dimensionality reduction over frequent and closed frequent itemsets. We propose GPHC, a sub-linearly scalable global pattern-based hierarchical clustering algorithm that uses closed interesting itemsets, and show that this algorithm achieves up to 11% better FScores and up to 5 times better entropies as compared to state-of-the-art agglomerative, partitioning-based, and pattern-based hierarchical clustering algorithms on 9 common datasets. Our third contribution addresses problems associated with using globally significant patterns for clustering. We propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC allows each instance to "vote" for its representative size-2 patterns in a way that ensures an effective balance between local and global pattern significance, produces more descriptive cluster labels, and allows a more flexible soft clustering scheme. Results of experiments performed on 40 standard datasets show that IDHC almost always outperforms state-of-the-art hierarchical clustering algorithms and achieves up to 15 times better entropies, without requiring any tuning of parameter values, even on highly correlated datasets. In our fourth contribution, we propose CPHC, a semi-supervised classification algorithm that uses a pattern-based cluster hierarchy as a direct means for classification. All training and test instances are first clustered together using our instance-driven pattern-based hierarchical clustering algorithm, and the resulting cluster hierarchy is then used directly to classify test instances, eliminating the need to train a classifier on an enhanced training set. For each test instance, we first use the hierarchical structure to identify nodes that contain the test instance, and then use the labels of co-existing training instances, weighing them proportionately to their pattern lengths, to obtain class label(s) for the test instance. Results of experiments performed on 19 standard datasets show that CPHC outperforms a number of existing classification algorithms even with sparse training data. Our final contribution deals with the problem of finding a dataset representation that offers a good space-time tradeoff for fast support (i.e., frequency) counting and also automatically identities transactions that contain the query itemset. We compare FP Trees and Compressed Patricia Tries against several novel variants of vertical bit vectors. We compress vertical bit vectors using WAH encoding and show that simple lexicographic ordering may outperform the Gray code rank-based transaction reordering scheme in terms of RLE compression. These observations lead us to propose HDO, a novel Hamming-distance-based greedy transaction reordering scheme, and aHDO, a linear-time approximation to HDO. We present results of experiments performed on 15 common datasets with varying degrees of sparseness, and show that HDO-reordered, WAH encoded bit vectors may take as link as 5% of the uncompressed space, while aHDO achieves similar compression on sparse datasets. With results from over 109 database and data mining style frequency query executions, we show that bitmap-based approaches result in up to 102 times faster support counting, and that HDO-WAH encoded bitmaps offer the best space-time tradeoff.