Optimal distance bounds for fast search on compressed time-series query logs

Authors:
Michail Vlachos;Suleyman S. Kozat;Philip S. Yu
Affiliations:
IBM Zürich Research Lab, Switzerland;Koç University, Istanbul, Turkey;University of Illinois at Chicago, Chicago, IL
Venue:
ACM Transactions on the Web (TWEB)
Year:
2010

Citing 35
Cited 1

Elements of information theory

Elements of information theory
Signals & systems (2nd ed.)

Signals & systems (2nd ed.)
Distance-based indexing for high-dimensional metric spaces

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Locally adaptive dimensionality reduction for indexing large time series databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Variable Length Queries for Time Series Data

Proceedings of the 17th International Conference on Data Engineering
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Fast Time Sequence Indexing for Arbitrary Lp Norms

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Dynamic vp-tree indexing for n-nearest neighbor search given pair-wise distances

The VLDB Journal — The International Journal on Very Large Data Bases
Bursty and hierarchical structure in streams

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

Data Mining and Knowledge Discovery
Identifying similarities, periodicities and bursts for online search queries

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Indexing spatio-temporal trajectories with Chebyshev polynomials

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Semantic similarity between search engine queries using temporal correlation

WWW '05 Proceedings of the 14th international conference on World Wide Web
Visualizing tags over time

Proceedings of the 15th international conference on World Wide Web
Time-dependent semantic similarity measure of queries using historical click-through data

Proceedings of the 15th international conference on World Wide Web
Automatic computation of semantic proximity using taxonomic knowledge

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Measuring the meaning in time series clustering of text search queries

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Temporal analysis of a very large topically categorized Web query log

Journal of the American Society for Information Science and Technology
Similarity of Temporal Query Logs Based on ARIMA Model

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Why we search: visualizing and predicting user behavior

Proceedings of the 16th international conference on World Wide Web
Measuring semantic similarity between words using web search engines

Proceedings of the 16th international conference on World Wide Web
Causal relation of queries from temporal logs

Proceedings of the 16th international conference on World Wide Web
Towards extracting flickr tag semantics

Proceedings of the 16th international conference on World Wide Web
Mining correlated bursty topic patterns from coordinated text streams

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Exact indexing of dynamic time warping

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The TS-tree: efficient time series search and retrieval

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Learning about the world through long-term query logs

ACM Transactions on the Web (TWEB)
Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3
Examining repetition in user search behavior

ECIR'07 Proceedings of the 29th European conference on IR research
A web search method based on the temporal relation of query keywords

WISE'06 Proceedings of the 7th international conference on Web Information Systems
Clustering of search engine keywords using access logs

DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Using query profiles for clarification

ECIR'06 Proceedings of the 28th European conference on Advances in Information Retrieval

tsdb: a compressed database for time series

TMA'12 Proceedings of the 4th international conference on Traffic Monitoring and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Consider a database of time-series, where each datapoint in the series records the total number of users who asked for a specific query at an internet search engine. Storage and analysis of such logs can be very beneficial for a search company from multiple perspectives. First, from a data organization perspective, because query Weblogs capture important trends and statistics, they can help enhance and optimize the search experience (keyword recommendation, discovery of news events). Second, Weblog data can provide an important polling mechanism for the microeconomic aspects of a search engine, since they can facilitate and promote the advertising facet of the search engine (understand what users request and when they request it). Due to the sheer amount of time-series Weblogs, manipulation of the logs in a compressed form is an impeding necessity for fast data processing and compact storage requirements. Here, we explicate how to compute the lower and upper distance bounds on the time-series logs when working directly on their compressed form. Optimal distance estimation means tighter bounds, leading to better candidate selection/elimination and ultimately faster search performance. Our derivation of the optimal distance bounds is based on the careful analysis of the problem using optimization principles. The experimental evaluation suggests a clear performance advantage of the proposed method, compared to previous compression/search techniques. The presented method results in a 10--30% improvement on distance estimations, which in turn leads to 25--80% improvement on the search performance.