CONTOUR: an efficient algorithm for discovering discriminating subsequences

Authors:
Jianyong Wang;Yuzhou Zhang;Lizhu Zhou;George Karypis;Charu C. Aggarwal
Affiliations:
Tsinghua University, Beijing, China 100084;Tsinghua University, Beijing, China 100084;Tsinghua University, Beijing, China 100084;University of Minnesota, Minneapolis, USA 55455;IBM T.J. Watson Research Center, Hawthorne, USA 10532
Venue:
Data Mining and Knowledge Discovery
Year:
2009

Citing 28
Cited 2

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
FreeSpan: frequent pattern-projected sequential pattern mining

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
SPADE: an efficient algorithm for mining frequent sequences

Machine Learning
Mining long sequential patterns in a noisy environment

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Introduction to Algorithms

Introduction to Algorithms
Mining sequential patterns with constraints in large databases

Proceedings of the eleventh international conference on Information and knowledge management
Discovery of Frequent Episodes in Event Sequences

Data Mining and Knowledge Discovery
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Cyclic Association Rules

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
A Scalable Algorithm for Clustering Sequential Data

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
The PSP Approach for Mining Sequential Patterns

PKDD '98 Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery
SPIRIT: Sequential Pattern Mining with Regular Expression Constraints

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Evaluation of Techniques for Classifying Biological Sequences

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Sequential PAttern mining using a bitmap representation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
SLPMiner: An Algorithm for Finding Frequent Sequential Patterns Using Length-Decreasing Support Constraint

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient Mining of Partial Periodic Patterns in Time Series Database

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Frequent-subsequence-based prediction of outer membrane proteins

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
BIDE: Efficient Mining of Frequent Closed Sequences

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
SUMMARY: Efficiently Summarizing Transactions for Clustering

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
C-Miner: Mining Block Correlations in Storage Systems

FAST '04 Proceedings of the 3rd USENIX Conference on File and Storage Technologies
Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Efficiently Mining Frequent Closed Partial Orders

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code

IEEE Transactions on Software Engineering
A methodology for clustering XML documents by structure

Information Systems
Xproj: a framework for projected structural clustering of xml documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A framework for clustering evolving data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29

XML data clustering: An overview

ACM Computing Surveys (CSUR)
Efficient Mining of Gap-Constrained Subsequences and Its Various Applications

ACM Transactions on Knowledge Discovery from Data (TKDD)

Quantified Score

Hi-index	0.01

Visualization

Abstract

In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.