ACM Computing Surveys (CSUR) - Annals of discrete mathematics, 24
Evaluation of signature files as set access facilities in OODBs
SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast subsequence matching in time-series databases
SIGMOD '94 Proceedings of the 1994 ACM SIGMOD international conference on Management of data
Size-estimation framework with applications to transitive closure and reachability
Journal of Computer and System Sciences
Min-wise independent permutations (extended abstract)
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Multidimensional access methods
ACM Computing Surveys (CSUR)
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Selectively estimation for Boolean queries
PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Dimensionality reduction techniques for proximity problems
SODA '00 Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms
Information Retrieval
Object Relational DBMSs: The Next Great Wave
Object Relational DBMSs: The Next Great Wave
Efficient Similarity Search In Sequence Databases
FODO '93 Proceedings of the 4th International Conference on Foundations of Data Organization and Algorithms
Counting Twig Matches in a Tree
Proceedings of the 17th International Conference on Data Engineering
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Generalized Search Trees for Database Systems
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases
VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Finding Interesting Associations without Support Pruning
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Efficient similarity search for market basket data
The VLDB Journal — The International Journal on Very Large Data Bases
Efficient and effective web change detection
Data & Knowledge Engineering
THESUS: Organizing Web document collections based on link semantics
The VLDB Journal — The International Journal on Very Large Data Bases
Localized signature table: fast similarity search on transaction data
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Multimedia Correlation Analysis in Unstructured Peer-to-Peer Networks
WOWMOM '06 Proceedings of the 2006 International Symposium on on World of Wireless, Mobile and Multimedia Networks
A combination of trie-trees and inverted files for the indexing of set-valued attributes
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Mining taxonomies of process models
Data & Knowledge Engineering
Efficient Similarity Search for Tree-Structured Data
SSDBM '08 Proceedings of the 20th international conference on Scientific and Statistical Database Management
Fast approximate duplicate detection for 2D-NMR spectra
DILS'07 Proceedings of the 4th international conference on Data integration in the life sciences
Proceedings of the 19th international conference on World wide web
TACO: tunable approximate computation of outliers in wireless sensor networks
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Random hyperplane projection using derived dimensions
Proceedings of the Ninth ACM International Workshop on Data Engineering for Wireless and Mobile Access
Efficient answering of set containment queries for skewed item distributions
Proceedings of the 14th International Conference on Extending Database Technology
Theory and applications of b-bit minwise hashing
Communications of the ACM
Hierarchical semantic-based index for ad hoc image retrieval
Journal of Mobile Multimedia
Similarity search in transaction databases with a two-level bounding mechanism
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Distributed similarity estimation using derived dimensions
The VLDB Journal — The International Journal on Very Large Data Bases
Similarity search in sensor networks using semantic-based caching
Journal of Network and Computer Applications
Efficient processing of probabilistic set-containment queries on uncertain set-valued data
Information Sciences: an International Journal
In-network approximate computation of outliers with quality guarantees
Information Systems
Hi-index | 0.02 |
Set value attributes are a concise and natural way to model complex data sets. Modern Object Relational systems support set value attributes and allow various query capabilities on them. In this paper we initiate a formal study of indexing techniques for set value attributes based on similarity, for suitably defined notions of similarity between sets. Such techniques are necessary in modern applications such as recommendations through collaborative filtering and automated advertising. Our techniques are probabilistic and approximate in nature. As a design principle we create structures that make use of well known and widely used data structuring techniques, as a means to ease integration with existing infrastructure.We show how the problem of indexing a collection of sets based on similarity can be reduced to the problem of indexing suitably encoded (in a way that preserves similarity) binary vectors in Hamming space thus, reducing the problem to one of similarity query processing in Hamming space. Then, we introduce and analyze two data structure primitives that we use in cooperation to perform similarity query processing in a Hamming space. We show how the resulting indexing technique can be optimized for properties of interest by formulating constraint optimization problems based on the space one is willing to devote for indexing. Finally we present experimental results from a prototype implementation of our techniques using real life datasets exploring the accuracy and efficiency of our overall approach as well as the quality of our solutions to problems related to the optimization of the indexing scheme.