Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Efficient and tumble similar set retrieval
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
Compressing historical information in sensor networks
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Distributed fault detection of wireless sensor networks
DIWANS '06 Proceedings of the 2006 workshop on Dependability issues in wireless ad hoc networks and sensor networks
Randomized algorithms and NLP: using locality sensitive hash function for high speed noun clustering
ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Using sensorranks for in-network detection of faulty readings in wireless sensor networks
MobiDE '07 Proceedings of the 6th ACM international workshop on Data engineering for wireless and mobile access
A topology-aware hierarchical structured overlay network based on locality sensitive hashing scheme
Proceedings of the second workshop on Use of P2P, GRID and agents for the development of content networks
Another Outlier Bites the Dust: Computing Meaningful Aggregates in Sensor Networks
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
TACO: tunable approximate computation of outliers in wireless sensor networks
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Declarative support for sensor data cleaning
PERVASIVE'06 Proceedings of the 4th international conference on Pervasive Computing
PAO: power-efficient attribution of outliers in wireless sensor networks
Proceedings of the Seventh International Workshop on Data Management for Sensor Networks
In-network approximate computation of outliers with quality guarantees
Information Systems
Hi-index | 0.00 |
Computing the similarity between data objects is a fundamental operation for many distributive applications such as those on the Word Wide Wed, in Peer-to-Peer networks or even in Sensor Networks. Locality Sensitive Hashing (LSH) has been recently proposed in order to reduce the number of bits that need to be transmitted between sites in order to permit evaluation of different similarity functions between the data objects. In our work we investigate a particular form of LSH, termed Random Hyperplane Projection (RHP). RHP is a data agnostic model that works for arbitrary data sets. However, data in most applications is not uniform. In our work, we first describe the shortcomings of the RHP scheme, in particular, its inefficiency to exploit evident skew in the underlying data distribution and then propose a novel framework that automatically detects correlations and computes an RHP embedding in the Hamming cube tailored to the provided data set. We further discuss extensions of our framework in order to cope with changes in the data distribution or outliers. In such cases our technique automatically reverts to the basic RHP model for data items that can not be described accurately through the computed embedding. Our experimental evaluation using several real datasets demonstrates that our proposed scheme outperforms the existing RHP algorithm providing up to three times more accurate similarity computations using the same number of bits.