Mining discriminative items in multiple data streams

Authors:
Zhenhua Lin;Bin Jiang;Jian Pei;Daxin Jiang
Affiliations:
Simon Fraser University, Burnaby, Canada;Simon Fraser University, Burnaby, Canada;Simon Fraser University, Burnaby, Canada;Microsoft Research Asia, Beijing, China
Venue:
World Wide Web
Year:
2010

Citing 34
Cited 0

The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Efficient mining of emerging patterns: discovering trends and differences

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploring constraints to efficiently mine emerging patterns from large high-dimensional datasets

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Space-efficient online computation of quantile summaries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Finding Frequent Items in Data Streams

ICALP '02 Proceedings of the 29th International Colloquium on Automata, Languages and Programming
An Efficient Single-Scan Algorithm for Mining Essential Jumping Emerging Patterns for Classification

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Frequency Estimation of Internet Packet Streams with Limited Space

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
Fast Algorithms for Mining Emerging Patterns

PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
Universal classes of hash functions (Extended Abstract)

STOC '77 Proceedings of the ninth annual ACM symposium on Theory of computing
Latent dirichlet allocation

The Journal of Machine Learning Research
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
An improved data stream summary: the count-min sketch and its applications

Journal of Algorithms
Visualizing tags over time

Proceedings of the 15th international conference on World Wide Web
Exploring social annotations for the semantic web

Proceedings of the 15th international conference on World Wide Web
A simpler and more efficient deterministic scheme for finding frequent items over sliding windows

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Dynamic topic models

ICML '06 Proceedings of the 23rd international conference on Machine learning
Summarizing email conversations with clue words

Proceedings of the 16th international conference on World Wide Web
The complex dynamics of collaborative tagging

Proceedings of the 16th international conference on World Wide Web
Tag clouds for summarizing web search results

Proceedings of the 16th international conference on World Wide Web
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Novelty-based Clustering Method for On-line Documents

World Wide Web
Deciphering mobile search patterns: a study of Yahoo! mobile search queries

Proceedings of the 17th international conference on World Wide Web
Finding frequent items in data streams

Proceedings of the VLDB Endowment
A New Method to Find Top K Items in Data Streams at Arbitrary Time Granularities

CSSE '08 Proceedings of the 2008 International Conference on Computer Science and Software Engineering - Volume 04
Enhancing diversity, coverage and balance for summarization through structure learning

Proceedings of the 18th international conference on World wide web
Tag ranking

Proceedings of the 18th international conference on World wide web
Learning to tag

Proceedings of the 18th international conference on World wide web
Tagommenders: connecting users to items through tags

Proceedings of the 18th international conference on World wide web
Tag-oriented document summarization

Proceedings of the 18th international conference on World wide web
An Operable Email Based Intelligent Personal Assistant

World Wide Web
Lower bounds on frequency estimation of data streams

CSR'08 Proceedings of the 3rd international conference on Computer science: theory and applications
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory

Quantified Score

Hi-index	0.00

Visualization

Abstract

How can we maintain a dynamic profile capturing a user's reading interest against the common interest? What are the queries that have been asked 1,000 times more frequently to a search engine from users in Asia than in North America? What are the keywords (or tags) that are 1,000 times more frequent in the blog stream on computer games than in the blog stream on Hollywood movies? To answer such interesting questions, we need to find discriminative items in multiple data streams. Each data source, such as Web search queries in a region and blog postings on a topic, can be modeled as a data stream due to the fast growing volume of the source. Motivated by the extensive applications, in this paper, we study the problem of mining discriminative items in multiple data streams. We show that, to exactly find all discriminative items in stream S 1 against stream S 2 by one scan, the space lower bound is $\Omega(|\Sigma| \log \frac{n_1}{|\Sigma|})$ , where Σ is the alphabet of items and n 1 is the current size of S 1. To tackle the space challenge, we develop three heuristic algorithms that can achieve high precision and recall using sub-linear space and sub-linear processing time per item with respect to |Σ|. The complexity of all algorithms are independent to the size of the two streams. An extensive empirical study using both real data sets and synthetic data sets verifies our design.