Efficient and tumble similar set retrieval

Authors:
Aristides Gionis;Dimitrios Gunopulos;Nick Koudas
Affiliations:
Stanford University;University of California, Riverside;AT&T Laboratories
Venue:
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Year:
2001

Citing 18
Cited 22

Access methods for text

ACM Computing Surveys (CSUR) - Annals of discrete mathematics, 24
Evaluation of signature files as set access facilities in OODBs

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast subsequence matching in time-series databases

SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability

Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Multidimensional access methods

ACM Computing Surveys (CSUR)
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Selectively estimation for Boolean queries

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Dimensionality reduction techniques for proximity problems

SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Information Retrieval

Information Retrieval
Object Relational DBMSs: The Next Great Wave

Object Relational DBMSs: The Next Great Wave
Efficient Similarity Search In Sequence Databases

FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Counting Twig Matches in a Tree

Proceedings of the 17th International Conference on Data Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Generalized Search Trees for Database Systems

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Finding Interesting Associations without Support Pruning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering

Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Efficient similarity search for market basket data

The VLDB Journal — The International Journal on Very Large Data Bases
Efficient and effective web change detection

Data & Knowledge Engineering
THESUS: Organizing Web document collections based on link semantics

The VLDB Journal — The International Journal on Very Large Data Bases
Localized signature table: fast similarity search on transaction data

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Multimedia Correlation Analysis in Unstructured Peer-to-Peer Networks

WOWMOM '06 Proceedings of the 2006 International Symposium on on World of Wireless, Mobile and Multimedia Networks
A combination of trie-trees and inverted files for the indexing of set-valued attributes

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Mining taxonomies of process models

Data & Knowledge Engineering
Efficient Similarity Search for Tree-Structured Data

SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Fast approximate duplicate detection for 2D-NMR spectra

DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
b-Bit minwise hashing

Proceedings of the 19th international conference on World wide web
TACO: tunable approximate computation of outliers in wireless sensor networks

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Random hyperplane projection using derived dimensions

Proceedings of the Ninth ACM International Workshop on Data Engineering for Wireless and Mobile Access
Efficient answering of set containment queries for skewed item distributions

Proceedings of the 14th International Conference on Extending Database Technology
Theory and applications of b-bit minwise hashing

Communications of the ACM
Hierarchical semantic-based index for ad hoc image retrieval

Journal of Mobile Multimedia
Similarity search in transaction databases with a two-level bounding mechanism

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Distributed similarity estimation using derived dimensions

The VLDB Journal — The International Journal on Very Large Data Bases
Similarity search in sensor networks using semantic-based caching

Journal of Network and Computer Applications
Efficient processing of probabilistic set-containment queries on uncertain set-valued data

Information Sciences: an International Journal
In-network approximate computation of outliers with quality guarantees

Information Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure.We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.