Streaming data reduction using low-memory factored representations

Authors:
David Littau;Daniel Boley
Affiliations:
Department of Computer Science and Engineering, University of Minnesota Twin Cities, 200 Union Street, SE, Minneapolis, MN 55455, United States;Department of Computer Science and Engineering, University of Minnesota Twin Cities, 200 Union Street, SE, Minneapolis, MN 55455, United States
Venue:
Information Sciences: an International Journal
Year:
2006

Citing 19
Cited 3

Using linear algebra for intelligent information retrieval

SIAM Review
A semidiscrete matrix decomposition for latent semantic indexing information retrieval

ACM Transactions on Information Systems (TOIS)
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Concept decompositions for large sparse text data using clustering

Machine Learning
Mining time-changing data streams

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Scaling mining algorithms to large databases

Communications of the ACM - Evolving data mining into solutions for insights
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Low-Rank Approximations with Sparse Factors I: Basic Algorithms and Error Analysis

SIAM Journal on Matrix Analysis and Applications
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
Principal Direction Divisive Partitioning

Data Mining and Knowledge Discovery
Continuous queries over data streams

ACM SIGMOD Record
Histogram-Based Approximation of Set-Valued Query-Answers

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Dynamic Maintenance of Wavelet-Based Histograms

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries

Proceedings of the 27th International Conference on Very Large Data Bases
Mining complex models from arbitrarily large databases in constant time

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Streaming-Data Algorithms for High-Quality Clustering

ICDE '02 Proceedings of the 18th International Conference on Data Engineering

Efficient continuous skyline computation

Information Sciences: an International Journal
An efficient algorithm for mining frequent inter-transaction patterns

Information Sciences: an International Journal
Mining frequent itemsets over data streams using efficient window sliding techniques

Expert Systems with Applications: An International Journal

Quantified Score

Hi-index	0.07

Visualization

Abstract

Many special purpose algorithms exist for extracting information from streaming data. Constraints are imposed on the total memory and on the average processing time per data item. These constraints are usually satisfied by deciding in advance the kind of information one wishes to extract, and then extracting only the data relevant for that goal. Here, we propose a general data representation that can be computed using modest memory requirements with limited processing power per data item, and yet permits the application of an arbitrary data mining algorithm chosen and/or adjusted after the data collection process has begun. The new representation allows for the at-once analysis of a significantly larger number of data items than would be possible using the original representation of the data. The method depends on a rapid computation of a factored form of the original data set. The method is illustrated with two real datasets, one with dense and one with sparse attribute values.