Scaling up top-K cosine similarity search

Authors:
Shiwei Zhu;Junjie Wu;Hui Xiong;Guoping Xia
Affiliations:
Information Systems Department, School of Economics and Management, Beihang University, Beijing 100191, China;Information Systems Department, School of Economics and Management, Beihang University, Beijing 100191, China;Management Science and Information Systems Department, Rutgers Business School - Newark and New Brunswick, Rutgers University, Newark, NJ 07102, USA;Information Systems Department, School of Economics and Management, Beihang University, Beijing 100191, China
Venue:
Data & Knowledge Engineering
Year:
2011

Citing 36
Cited 4

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Beyond market baskets: generalizing association rules to correlations

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Knowledge Discovery and Measures of Interest

Knowledge Discovery and Measures of Interest
Scalable Algorithms for Association Mining

IEEE Transactions on Knowledge and Data Engineering
Alternative Interest Measures for Mining Associations in Databases

IEEE Transactions on Knowledge and Data Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Selecting the right interestingness measure for association patterns

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient similarity search and classification via rank aggregation

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
CoMine: Efficient Mining of Correlated Patterns

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering

Machine Learning
CORDS: automatic discovery of correlations and soft functional dependencies

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Using Information-Theoretic Measures to Assess Association Rule Interestingness

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
A web-based kernel function for measuring the similarity of short text snippets

Proceedings of the 15th international conference on World Wide Web
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Hyperclique pattern discovery

Data Mining and Knowledge Discovery
Finding highly correlated pairs efficiently with powerful pruning

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
TOP-COP: Mining TOP-K Strongly Correlated Pairs in Large Databases

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Mining top-k strongly correlated item pairs without minimum correlation threshold

International Journal of Knowledge-based and Intelligent Engineering Systems
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Volatile correlation computation: a checkpoint view

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Association Mining in Large Databases: A Re-examination of Its Measures

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Semantic clustering of XML documents

ACM Transactions on Information Systems (TOIS)
Frameworks for entity matching: A comparison

Data & Knowledge Engineering
Combining ontological profiles with context in information retrieval

Data & Knowledge Engineering
Incremental all pairs similarity search for varying similarity thresholds

Proceedings of the 3rd Workshop on Social Network Mining and Analysis
UFOme: An ontology mapping system with strategy prediction capabilities

Data & Knowledge Engineering
Probabilistic models for answer-ranking in multilingual question-answering

ACM Transactions on Information Systems (TOIS)

Efficient mining of top correlated patterns based on null-invariant measures

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
An architecture for component-based design of representative-based clustering algorithms

Data & Knowledge Engineering
An approach for selecting seed URLs of focused crawler based on user-interest ontology

Applied Soft Computing
Editorial: A topic-specific crawling strategy based on semantics similarity

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOP-MATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items.