Distributed similarity estimation using derived dimensions

Authors:
Konstantinos Georgoulas;Yannis Kotidis
Affiliations:
Athens University of Economics and Business, Athens, Greece;Athens University of Economics and Business, Athens, Greece
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2012

Citing 41
Cited 1

FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming

Journal of the ACM (JACM)
The space complexity of approximating the frequency moments

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
The pyramid-technique: towards breaking the curse of dimensionality

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient and tumble similar set retrieval

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Searching in metric spaces

ACM Computing Surveys (CSUR)
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Fast, small-space algorithms for approximate histogram maintenance

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Fast Indexing and Visualization of Metric Data Sets using Slim-Trees

IEEE Transactions on Knowledge and Data Engineering
Indexing the Distance: An Efficient Method to KNN Processing

Proceedings of the 27th International Conference on Very Large Data Bases
One-Pass Wavelet Decompositions of Data Streams

IEEE Transactions on Knowledge and Data Engineering
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Compressing historical information in sensor networks

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
TAG: a Tiny AGgregation service for Ad-Hoc sensor networks

OSDI '02 Proceedings of the 5th symposium on Operating systems design and implementationCopyright restrictions prevent ACM from being able to make the PDFs for this conference available for downloading
LSH forest: self-tuning indexes for similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
BATON: a balanced tree structure for peer-to-peer networks

VLDB '05 Proceedings of the 31st international conference on Very large data bases
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
Distributed fault detection of wireless sensor networks

DIWANS '06 Proceedings of the 2006 workshop on Dependability issues in wireless ad hoc networks and sensor networks
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Using sensorranks for in-network detection of faulty readings in wireless sensor networks

MobiDE '07 Proceedings of the 6th ACM international workshop on Data engineering for wireless and mobile access
Robust management of outliers in sensor network aggregate queries

MobiDE '07 Proceedings of the 6th ACM international workshop on Data engineering for wireless and mobile access
A topology-aware hierarchical structured overlay network based on locality sensitive hashing scheme

Proceedings of the second workshop on Use of P2P, GRID and agents for the development of content networks
BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
Multi-probe LSH: efficient indexing for high-dimensional similarity search

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Bounded LSH for Similarity Search in Peer-to-Peer File Systems

ICPP '08 Proceedings of the 2008 37th International Conference on Parallel Processing
Modeling LSH for performance tuning

Proceedings of the 17th ACM conference on Information and knowledge management
Multi-query optimization for sketch-based estimation

Information Systems
Distributed similarity search in high dimensions using locality sensitive hashing

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Hierarchically compressed wavelet synopses

The VLDB Journal — The International Journal on Very Large Data Bases
Nearest Neighbor Retrieval Using Distance-Based Hashing

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Another Outlier Bites the Dust: Computing Meaningful Aggregates in Sensor Networks

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Skip-and-prune: cosine-based top-k query processing for efficient context-sensitive document retrieval

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Quality and efficiency in high dimensional nearest neighbor search

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Answering similarity queries in peer-to-peer networks

Information Systems
Efficient and accurate nearest neighbor and closest pair search in high-dimensional space

ACM Transactions on Database Systems (TODS)
TACO: tunable approximate computation of outliers in wireless sensor networks

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
PAO: power-efficient attribution of outliers in wireless sensor networks

Proceedings of the Seventh International Workshop on Data Management for Sensor Networks
Declarative support for sensor data cleaning

PERVASIVE'06 Proceedings of the 4th international conference on Pervasive Computing
Peer-to-peer similarity search based on m-tree indexing

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part II

Towards enabling outlier detection in large, high dimensional data warehouses

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Computing the similarity between data objects is a fundamental operation for many distributed applications such as those on the World Wide Web, in Peer-to-Peer networks, or even in Sensor Networks. In our work, we provide a framework based on Random Hyperplane Projection (RHP) that permits continuous computation of similarity estimates (using the cosine similarity or the correlation coefficient as the preferred similarity metric) between data descriptions that are streamed from remote sites. These estimates are computed at a monitoring node, without the need for transmitting the actual data values. The original RHP framework is data agnostic and works for arbitrary data sets. However, data in most applications is not uniform. In our work, we first describe the shortcomings of the RHP scheme, in particular, its inefficiency to exploit evident skew in the underlying data distribution and then propose a novel framework that automatically detects correlations and computes an RHP embedding in the Hamming cube tailored to the provided data set using the idea of derived dimensions we first introduce. We further discuss extensions of our framework in order to cope with changes in the data distribution. In such cases, our technique automatically reverts to the basic RHP model for data items that cannot be described accurately through the computed embedding. Our experimental evaluation using several real and synthetic data sets demonstrates that our proposed scheme outperforms the existing RHP algorithm and alternative techniques that have been proposed, providing significantly more accurate similarity computations using the same number of bits.