Mining long sequential patterns in a noisy environment

Authors:
Jiong Yang;Wei Wang;Philip S. Yu;Jiawei Han
Affiliations:
IBM;IBM;IBM;UIUC
Venue:
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Year:
2002

Citing 24
Cited 54

An efficient algorithm for sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining asynchronous periodic patterns in time series data

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
FreeSpan: frequent pattern-projected sequential pattern mining

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
SPADE: an efficient algorithm for mining frequent sequences

Machine Learning
Mining patterns in long sequential data with noise

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Infominer: mining surprising periodic patterns

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Levelwise Search and Borders of Theories in KnowledgeDiscovery

Data Mining and Knowledge Discovery
Discovery of Frequent Episodes in Event Sequences

Data Mining and Knowledge Discovery
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Pincer Search: A New Algorithm for Discovering the Maximum Frequent Set

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
Pattern Discovery in Biosequences

ICGI '98 Proceedings of the 4th International Colloquium on Grammatical Inference
Discovering All Most Specific Sentences by Randomized Algorithms

ICDT '97 Proceedings of the 6th International Conference on Database Theory
MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases

Proceedings of the 17th International Conference on Data Engineering
SPIRIT: Sequential Pattern Mining with Regular Expression Constraints

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Mining Generalized Association Rules

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Meta-patterns: Revealing Hidden Periodic Patterns

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications

Database research at the University of Illinois at Urbana-Champaign

ACM SIGMOD Record
OP-Cluster: Clustering by Tendency in High Dimensional Space

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Introducing Uncertainty into Pattern Discovery in Temporal Event Sequences

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
An Efficient Algorithm for Mining Frequent Sequences by a New Strategy without Support Counting

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Substructure Clustering on Sequential 3d Object Datasets

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
BIDE: Efficient Mining of Frequent Closed Sequences

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach

IEEE Transactions on Knowledge and Data Engineering
Scalable sequential pattern mining for biological sequences

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Localization Site Prediction for Membrane Proteins by Integrating Rule and SVM Classification

IEEE Transactions on Knowledge and Data Engineering
Mining Frequent Spatio-Temporal Sequential Patterns

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Sequential Pattern Mining in Multi-Databases via Multiple Alignment

Data Mining and Knowledge Discovery
Exploit sequencing to accelerate hot XML query pattern mining

Proceedings of the 2006 ACM symposium on Applied computing
Efficient mining of group patterns from user movement data

Data & Knowledge Engineering
Discovering Frequent Closed Partial Orders from Strings

IEEE Transactions on Knowledge and Data Engineering
Benchmarking the effectiveness of sequential pattern mining methods

Data & Knowledge Engineering
Mining evolving data streams for frequent patterns

Pattern Recognition
Constraint-based sequential pattern mining: the consideration of recency and compactness

Decision Support Systems
Constraint-based sequential pattern mining: the pattern-growth methods

Journal of Intelligent Information Systems
Extracting interpretable muscle activation patterns with time series knowledge mining

International Journal of Knowledge-based and Intelligent Engineering Systems
Frequent Closed Sequence Mining without Candidate Maintenance

IEEE Transactions on Knowledge and Data Engineering
Analyzing sequential patterns in retail databases

Journal of Computer Science and Technology
A regression-based temporal pattern mining scheme for data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
A new framework for detecting weighted sequential patterns in large sequence databases

Knowledge-Based Systems
Statistical supports for mining sequential patterns and improving the incremental update process on data streams

Intelligent Data Analysis - Knowlegde Discovery from Data Streams
Efficient mining of frequent closed XML query pattern

Journal of Computer Science and Technology
Permu-pattern: discovery of mutable permutation patterns with proximity constraint

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Constructing comprehensive summaries of large event sequences

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Improving the performance of an incremental algorithm driven by error margins

Intelligent Data Analysis - Knowledge Discovery from Data Streams
Smart support functions for sequential pattern mining

Journal of Computational Methods in Sciences and Engineering - Selected papers from the International Conference on Computer Science, Software Engineering, Information Technology, e-Business, and Applications, 2004
Efficient algorithms for incremental maintenance of closed sequential patterns in large databases

Data & Knowledge Engineering
CONTOUR: an efficient algorithm for discovering discriminating subsequences

Data Mining and Knowledge Discovery
Effective database transformation and efficient support computation for mining sequential patterns

Journal of Intelligent Information Systems
Effective temporal data classification by integrating sequential pattern mining and probabilistic induction

Expert Systems with Applications: An International Journal
EventSummarizer: a tool for summarizing large event sequences

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Incremental sequence-based frequent query pattern mining from XML queries

Data Mining and Knowledge Discovery
Efficient frequent sequence mining by a dynamic strategy switching algorithm

The VLDB Journal — The International Journal on Very Large Data Bases
Clustering sequences by overlap

International Journal of Data Mining and Bioinformatics
Mining sequential patterns across multiple sequence databases

Data & Knowledge Engineering
Constructing comprehensive summaries of large event sequences

ACM Transactions on Knowledge Discovery from Data (TKDD)
Mining convergent and divergent sequences in multidimensional data

International Journal of Business Intelligence and Data Mining
Discovering association patterns based on mutual information

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
Mining weighted sequential patterns in a sequence database with a time-interval weight

Knowledge-Based Systems
Efficient discovery of generalized sentinel rules

DEXA'10 Proceedings of the 21st international conference on Database and expert systems applications: Part II
On probabilistic models for uncertain sequential pattern mining

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Mining sequential patterns from probabilistic databases

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
Parallel mining of maximal sequential patterns using multiple samples

The Journal of Supercomputing
Incremental algorithm driven by error margins

DS'06 Proceedings of the 9th international conference on Discovery Science
TrajPattern: mining sequential patterns from imprecise trajectories of mobile objects

EDBT'06 Proceedings of the 10th international conference on Advances in Database Technology
Mining compressed sequential patterns

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Efficient Mining of Gap-Constrained Subsequences and Its Various Applications

ACM Transactions on Knowledge Discovery from Data (TKDD)
Mining probabilistically frequent sequential patterns in uncertain databases

Proceedings of the 15th International Conference on Extending Database Technology
Sequential pattern mining -- approaches and algorithms

ACM Computing Surveys (CSUR)
OLAP for moving object data

International Journal of Intelligent Information and Database Systems
Mining sequential patterns with extensible knowledge representation

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Pattern discovery in long sequences is of great importance in many applications including computational biology study, consumer behavior analysis, system performance analysis, etc. In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with little impact to the biological function of the protein. It would be desirable if the occurrence of D in the observation can be related to a possible mutation from N in an appropriate manner. Unfortunately, the support measure (i.e., the number of occurrences) of a pattern does not serve this purpose. In this paper, we introduce the concept of compatibility matrix as the means to provide a probabilistic connection from the observation to the underlying true value. A new metric match is also proposed to capture the "real support" of a pattern which would be expected if a noise-free environment is assumed. In addition, in the context we address, a pattern could be very long. The standard pruning technique developed for the market basket problem may not work efficiently. As a result, a novel algorithm that combines statistical sampling and a new technique (namely border collapsing) is devised to discover long patterns in a minimal number of scans of the sequence database with sufficiently high confidence. Empirical results demonstrate the robustness of the match model (with respect to the noise) and the efficiency of the probabilistic algorithm.