Scalable sequential pattern mining for biological sequences

Authors:
Ke Wang;Yabo Xu;Jeffrey Xu Yu
Affiliations:
Simon Fraser University;Simon Fraser University and Chinese University of Hong Kong;Chinese University of Hong Kong
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 15
Cited 17

Combinatorial pattern discovery for scientific data: some preliminary results

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Efficiently mining long patterns from databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Depth first generation of long patterns

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Main-memory index structures with fixed-size partial keys

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining long sequential patterns in a noisy environment

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Discovery of Frequent Episodes in Event Sequences

Data Mining and Knowledge Discovery
SPADE: An Efficient Algorithm for Mining Frequent Sequences

Machine Learning
Mining Sequential Patterns: Generalizations and Performance Improvements

EDBT '96 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth

Proceedings of the 17th International Conference on Data Engineering
MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases

Proceedings of the 17th International Conference on Data Engineering
A Statistical Method for Finding Transcription Factor Binding Sites

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A Double Combinatorial Approach to Discovering Patterns in Biological Sequences

CPM '96 Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching
Sequential PAttern mining using a bitmap representation

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

A sampling-based framework for parallel data mining

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Localization Site Prediction for Membrane Proteins by Integrating Rule and SVM Classification

IEEE Transactions on Knowledge and Data Engineering
Analyzing sequential patterns in retail databases

Journal of Computer Science and Technology
Mining sequential patterns for protein fold recognition

Journal of Biomedical Informatics
A new framework for detecting weighted sequential patterns in large sequence databases

Knowledge-Based Systems
A two-stage methodology for sequence classification based on sequential pattern mining and optimization

Data & Knowledge Engineering
Permu-pattern: discovery of mutable permutation patterns with proximity constraint

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Sequential Pattern Mining for Protein Function Prediction

ADMA '08 Proceedings of the 4th international conference on Advanced Data Mining and Applications
Efficient algorithms for incremental maintenance of closed sequential patterns in large databases

Data & Knowledge Engineering
Pattern matching with wildcards based on key character location

IRI'09 Proceedings of the 10th IEEE international conference on Information Reuse & Integration
Mining weighted sequential patterns in a sequence database with a time-interval weight

Knowledge-Based Systems
On probabilistic models for uncertain sequential pattern mining

ADMA'10 Proceedings of the 6th international conference on Advanced data mining and applications: Part I
Mining interestingness measures for string pattern mining

Knowledge-Based Systems
Efficient algorithm for mining correlated Protein-DNA binding cores

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
General algorithms for mining closed flexible patterns under various equivalence relations

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part II
Sequential pattern mining -- approaches and algorithms

ACM Computing Surveys (CSUR)
A two-phase algorithm for mining sequential patterns with differential privacy

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Biosequences typically have a small alphabet, a long length, and patterns containing gaps (i.e., "don't care") of arbitrary size. Mining frequent patterns in such sequences faces a different type of explosion than in transaction sequences primarily motivated in market-basket analysis. In this paper, we study how this explosion affects the classic sequential pattern mining, and present a scalable two-phase algorithm to deal with this new explosion. The Segment Phase first searches for short patterns containing no gaps, called segments. This phase is efficient. The Pattern Phase searches for long patterns containing multiple segments separated by variable length gaps. This phase is time consuming. The purpose of two phases is to exploit the information obtained from the first phase to speed up the pattern growth and matching and to prune the search space in the second phase. We evaluate this approach on synthetic and real life data sets.