Top-k typicality queries and efficient query answering methods on large databases

Authors:
Ming Hua;Jian Pei;Ada W. Fu;Xuemin Lin;Ho-Fung Leung
Affiliations:
Simon Fraser University, Burnaby, Canada;Simon Fraser University, Burnaby, Canada;The Chinese University of Hong Kong, Shatin, Hong Kong, China;The University of New South Wales, Sydney, Australia and NICTA, Sydney, Australia;The Chinese University of Hong Kong, Shatin, Hong Kong, China
Venue:
The VLDB Journal — The International Journal on Very Large Data Bases
Year:
2009

Citing 29
Cited 5

A course in density estimation

A course in density estimation
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A constant-factor approximation algorithm for the k-median problem (extended abstract)

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
Sublinear time algorithms for metric space problems

STOC '99 Proceedings of the thirty-first annual ACM symposium on Theory of computing
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Data structures and algorithms for nearest neighbor search in general metric spaces

SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
Indexing large metric spaces for similarity search queries

ACM Transactions on Database Systems (TODS)
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
CLARANS: A Method for Clustering Objects for Spatial Data Mining

IEEE Transactions on Knowledge and Data Engineering
Fast approximations for sums of distances, clustering and the Fermat--Weber problem

Computational Geometry: Theory and Applications
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Fast probabilistic algorithms for hamiltonian circuits and matchings

STOC '77 Proceedings of the ninth annual ACM symposium on Theory of computing
Query Processing Issues in Image(Multimedia) Databases

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
An ontology model to facilitate knowledge-sharing in multi-agent systems

The Knowledge Engineering Review
Spatially-decaying aggregation over a network: model and algorithms

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
A Survey of Outlier Detection Methodologies

Artificial Intelligence Review
Selectivity estimators for multidimensional range queries over real attributes

The VLDB Journal — The International Journal on Very Large Data Bases
An Efficient Approximate Algorithm for the 1-Median Problem in Metric Spaces

SIAM Journal on Optimization
Continuous monitoring of top-k queries over sliding windows

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Extracting redundancy-aware top-k patterns

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Answering top-k queries using views

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Answering top-k queries with multi-dimensional selections: the ranking cube approach

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Spatially-decaying aggregation over a network

Journal of Computer and System Sciences
Efficiently answering top-k typicality queries on large databases

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A Formal Model of Ontology for Handling Fuzzy Membership and Typicality of Instances

The Computer Journal
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Efficient processing of exact top-k queries over disk-resident sorted lists

The VLDB Journal — The International Journal on Very Large Data Bases
Providing built-in keyword search capabilities in RDBMS

The VLDB Journal — The International Journal on Very Large Data Bases
Conceptual views for entity-centric search: turning data into meaningful concepts

Computer Science - Research and Development
Answering Typicality Query Based on Automatically Prototype Construction

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Redundancy-aware maximal cliques

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognitive science to database query answering, and study the novel problem of answering top-k typicality queries. We model typicality in large data sets systematically. Three types of top-k typicality queries are formulated. To answer questions like "Who are the top-k most typical NBA players?", the measure of simple typicality is developed. To answer questions like "Who are the top-k most typical guards distinguishing guards from other players?", the notion of discriminative typicality is proposed. Moreover, to answer questions like "Who are the best k typical guards in whole representing different types of guards?", the notion of representative typicality is used. Computing the exact answer to a top-k typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations: (1) the randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers; (2) the direct local typicality approximation using VP-trees provides an approximation quality guarantee; (3) a local typicality tree data structure can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree. An extensive performance study using two real data sets and a series of synthetic data sets clearly shows that top-k typicality queries are meaningful and our methods are practical.