Learning Approximate Sequential Patterns for Classification

Authors:
Zeeshan Syed;Piotr Indyk;John Guttag
Affiliations:
-;-;-
Venue:
The Journal of Machine Learning Research
Year:
2009

Citing 13
Cited 1

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Causality: models, reasoning, and inference

Causality: models, reasoning, and inference
Knowledge management and data mining for marketing

Decision Support Systems - Knowledge management support of decision making
Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications

Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications
Using the Fisher Kernel Method to Detect Remote Protein Homologies

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
Combinatorial Approaches to Finding Subtle Signals in DNA Sequences

Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology
A symbolic representation of time series, with implications for streaming algorithms

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Mining risk patterns in medical data

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Finding motifs from all sequences with and without binding sites

Bioinformatics
Clustering and symbolic analysis of cardiovascular signals: discovery and visualization of medically relevant patterns in long-term data using limited prior knowledge

EURASIP Journal on Applied Signal Processing
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases

Diverse near neighbor problem

Proceedings of the twenty-ninth annual symposium on Computational geometry

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present an automated approach to discover patterns that can distinguish between sequences belonging to different labeled groups. Our method searches for approximately conserved motifs that occur with varying statistical properties in positive and negative training examples. We propose a two-step process to discover such patterns. Using locality sensitive hashing (LSH), we first estimate the frequency of all subsequences and their approximate matches within a given Hamming radius in labeled examples. The discriminative ability of each pattern is then assessed from the estimated frequencies by concordance and rank sum testing. The use of LSH to identify approximate matches for each candidate pattern helps reduce the runtime of our method. Space requirements are reduced by decomposing the search problem into an iterative method that uses a single LSH table in memory. We propose two further optimizations to the search for discriminative patterns. Clustering with redundancy based on a 2-approximate solution of the k-center problem decreases the number of overlapping approximate groups while providing exhaustive coverage of the search space. Sequential statistical methods allow the search process to use data from only as many training examples as are needed to assess significance. We evaluated our algorithm on data sets from different applications to discover sequential patterns for classification. On nucleotide sequences from the Drosophila genome compared with random background sequences, our method was able to discover approximate binding sites that were preserved upstream of genes. We observed a similar result in experiments on ChIP-on-chip data. For cardiovascular data from patients admitted with acute coronary syndromes, our pattern discovery approach identified approximately conserved sequences of morphology variations that were predictive of future death in a test population. Our data showed that the use of LSH, clustering, and sequential statistics improved the running time of the search algorithm by an order of magnitude without any noticeable effect on accuracy. These results suggest that our methods may allow for an unsupervised approach to efficiently learn interesting dissimilarities between positive and negative examples that may have a functional role.