A set of level 3 basic linear algebra subprograms
ACM Transactions on Mathematical Software (TOMS)
Building a scalable and accurate copy detection mechanism
Proceedings of the first ACM international conference on Digital libraries
Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Modern Information Retrieval
ACM Transactions on Mathematical Software (TOMS)
S+: Efficient 2D Sparse LU Factorization on Parallel Machines
SIAM Journal on Matrix Analysis and Applications
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Cache Conscious Algorithms for Relational Query Processing
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Performance optimizations and bounds for sparse matrix-vector multiply
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Improved robustness of signature-based near-replica detection via lexicon randomization
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
A web-based kernel function for measuring the similarity of short text snippets
Proceedings of the 15th international conference on World Wide Web
Efficient exact set-similarity joins
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search
Proceedings of the 16th international conference on World Wide Web
Detectives: detecting coalition hit inflation attacks in advertising networks streams
Proceedings of the 16th international conference on World Wide Web
Generic database cost models for hierarchical memory systems
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce
Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Adaptive near-duplicate detection via similarity learning
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Auralist: introducing serendipity into music recommendation
Proceedings of the fifth ACM international conference on Web search and data mining
Clustering and load balancing optimization for redundant content removal
Proceedings of the 21st international conference companion on World Wide Web
Optimizing parallel algorithms for all pairs similarity search
Proceedings of the sixth ACM international conference on Web search and data mining
Hi-index | 0.00 |
All-pairs similarity search can be implemented in two stages. The first stage is to partition the data and group potentially similar vectors. The second stage is to run a set of tasks where each task compares a partition of vectors with other candidate partitions. Because of data sparsity, accessing feature vectors in memory for runtime comparison in the second stage, incurs significant overhead due to the presence of memory hierarchy. This paper proposes a cache-conscious data layout and traversal optimization to reduce the execution time through size-controlled data splitting and vector coalescing. It also provides an analysis to guide the optimal choice for the parameter setting. Our evaluation with several application datasets verifies the performance gains obtained by the optimization and shows that the proposed scheme is upto 2.74x as fast as the cache-oblivious baseline.