Fast Mining of Massive Tabular Data via Approximate Distance Computations

Authors:
Affiliations:
Venue:
ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Year:
2002

Citing 0
Cited 10

Comparing Data Streams Using Hamming Norms (How to Zero In)

IEEE Transactions on Knowledge and Data Engineering
One-Pass Wavelet Decompositions of Data Streams

IEEE Transactions on Knowledge and Data Engineering
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
XML stream processing using tree-edit distance embeddings

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2003
Stable distributions, pseudorandom generators, embeddings, and data stream computation

Journal of the ACM (JACM)
Comparing data streams using Hamming norms (how to zero in)

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Spectral clustering in telephone call graphs

Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis
Spectral Clustering in Social Networks

Advances in Web Mining and Web Usage Analysis
On the exact space complexity of sketching and streaming small norms

SODA '10 Proceedings of the twenty-first annual ACM-SIAM symposium on Discrete Algorithms
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. For example, consider the geographic distribution of cell phone traffic at different base stations across the country or the evolution of traffic at Internet routers over time.Detecting similarity patterns in such data sets (e.g., which geographic regions have similar cell phone usage distribution, which IP subnet traffic distributions over time intervals are similar, etc) is of great importance. Identification of such patterns poses many conceptual challenges (what is a suitable similarity distance function for two ``regions'') as well as technical challenges (how to perform similarity computations efficiently as massive tables get accumulated over time) that we address.We present methods for determining similar regions in massive tabular data. Our methods are for computing the ``distance'' between any two subregions of a tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use Lp norms. A novelty of our distance computation procedures is that they work for any Lp norms --- not only the traditional p=2 or p=1, but for all p