Compression-based data mining of sequential data

Authors:
Eamonn Keogh;Stefano Lonardi;Chotirat Ann Ratanamahatana;Li Wei;Sang-Hee Lee;John Handley
Affiliations:
Department of Computer Science and Engineering, University of California, Riverside, USA 92521;Department of Computer Science and Engineering, University of California, Riverside, USA 92521;Department of Computer Engineering, Chulalongkorn University, Bangkok, Thailand;Department of Computer Science and Engineering, University of California, Riverside, USA 92521;Department of Anthropology, University of California, Riverside, USA 92521;Xerox Innovation Group, Xerox Corporation, New York, USA 14580-9701
Venue:
Data Mining and Knowledge Discovery
Year:
2007

Citing 27
Cited 9

Inferring decision trees using the minimum description length principle

Information and Computation
Graph clustering and model learning by data compression

Proceedings of the seventh international conference (1990) on Machine learning
FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
An introduction to Kolmogorov complexity and its applications (2nd ed.)

An introduction to Kolmogorov complexity and its applications (2nd ed.)
On the entropy of DNA: algorithms and measurements based on memory and rapid convergence

Proceedings of the sixth annual ACM-SIAM symposium on Discrete algorithms
Deformable Markov model templates for time-series pattern matching

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining the stock market (extended abstract): which measure is best?

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Magical thinking in data mining: lessons from CoIL challenge 2000

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
On Comparing Classifiers: Pitfalls toAvoid and a Recommended Approach

Data Mining and Knowledge Discovery
Graph-Based Data Mining

IEEE Intelligent Systems
The similarity metric

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
A Hierarchical Model for Clustering and Categorising Documents

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
A Process-Oriented Heuristic for Model Selection

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Distance Measures for Effective Clustering of ARIMA Time-Series

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Implementing Lazy Database Updates for an Object Database System

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
On the need for time series data mining benchmarks: a survey and empirical demonstration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
TSA-Tree: A Wavelet-Based Approach to Improve the Efficiency of Multi-Level Surprise and Trend Queries on Time-Series Data

SSDBM '00 Proceedings of the 12th International Conference on Scientific and Statistical Database Management
DNA Sequence Classification Using Compression-Based Induction

DNA Sequence Classification Using Compression-Based Induction
A symbolic representation of time series, with implications for streaming algorithms

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Research

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Indexing multi-dimensional time-series with support for multiple distance measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Online novelty detection on temporal sequences

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A compression-based algorithm for Chinese word segmentation

Computational Linguistics
Fully automatic cross-associations

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Parameter-Free Spatial Data Mining Using MDL

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Compression and Machine Learning: A New Perspective on Feature Space Vectors

DCC '06 Proceedings of the Data Compression Conference

Preprocessing techniques for context recognition from accelerometer data

Personal and Ubiquitous Computing
Visualization of text streams: a survey

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part II
A review on time series data mining

Engineering Applications of Artificial Intelligence
Nonapproximability of the normalized information distance

Journal of Computer and System Sciences
Krimp: mining itemsets that compress

Data Mining and Knowledge Discovery
The minimum code length for clustering using the gray code

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part III
Stock market co-movement assessment using a three-phase clustering method

Expert Systems with Applications: An International Journal
CID: an efficient complexity-invariant distance for time series

Data Mining and Knowledge Discovery
Exploring programmable self-assembly in non-DNA based molecular computing

Natural Computing: an international journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

The vast majority of data mining algorithms require the setting of many input parameters. The dangers of working with parameter-laden algorithms are twofold. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible. A parameter-light algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics, learning, and computational theory hold great promise for a parameter-light data-mining paradigm. The results are strongly connected to Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen lines of code. We will show that this approach is competitive or superior to many of the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/XML/video datasets. As a further evidence of the advantages of our method, we will demonstrate its effectiveness to solve a real world classification problem in recommending printing services and products.