On effective classification of strings with wavelets

Authors:
Charu C. Aggarwal
Affiliations:
IBM T. J. Watson Research Center, Yorktown Heights, NY
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 16
Cited 10

Classification algorithms

Classification algorithms
C4.5: programs for machine learning

C4.5: programs for machine learning
Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
BOAT—optimistic decision tree construction

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Event detection from time series data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying distinctive subsequences in multivariate time series by clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
On effective multi-dimensional indexing for strings

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Locally adaptive dimensionality reduction for indexing large time series databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Mining Sequential Patterns

ICDE '95 Proceedings of the Eleventh International Conference on Data Engineering
A Scalable Algorithm for Clustering Sequential Data

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
RainForest - A Framework for Fast Decision Tree Construction of Large Datasets

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Mining Deviants in a Time Series Database

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Mining of Partial Periodic Patterns in Time Series Database

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Landmarks: A New Model for Similarity-Based Pattern Querying in Time Series Databases

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

XRules: an effective structural classifier for XML data

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
BioSPRINT: Classification of Intron and Exon Sequences Using the SPRINT Algorithm

CSB '04 Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference
On the Use of Wavelet Decomposition for String Classification

Data Mining and Knowledge Discovery
XRules: An effective algorithm for structural classification of XML data

Machine Learning
On string classification in data streams

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A framework for condensation-based anonymization of string data

Data Mining and Knowledge Discovery
ULISSE, a network intrusion detection system

Proceedings of the 4th annual workshop on Cyber security and information intelligence research: developing strategies to meet the cyber security and information intelligence challenges ahead
A brief survey on sequence classification

ACM SIGKDD Explorations Newsletter
Building wavelet histograms on large data in MapReduce

Proceedings of the VLDB Endowment
Discovering consensus patterns in biological databases

VDMB'06 Proceedings of the First international conference on Data Mining and Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, the technological advances in mapping genes have made it increasingly easy to store and use a wide variety of biological data. Such data are usually in the form of very long strings for which it is difficult to determine the most relevant features for a classification task. For example, a typical DNA string may be millions of characters long, and there may be thousands of such strings in a database. In many cases, the classification behavior of the data may be hidden in the compositional behavior of certain segments of the string which cannot be easily determined apriori. Another problem which complicates the classification task is that in some cases the classification behavior is reflected in global behavior of the string, whereas in others it is reflected in local patterns. Given the enormous variation in the behavior of the strings over different data sets, it is useful to develop an approach which is sensitive to both the global and local behavior of the strings for the purpose of classification. For this purpose, we will exploit the multi-resolution property of wavelet decomposition in order to create a scheme which can mine classification characteristics at different levels of granularity. The resulting scheme turns out to be very effective in practice on a wide range of problems.