Supporting ranking and clustering as generalized order-by and group-by

Authors:
Chengkai Li;Min Wang;Lipyeow Lim;Haixun Wang;Kevin Chen-Chuan Chang
Affiliations:
University of Illinois at Urbana-Champaign, Urbana, IL;IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY;IBM T.J. Watson Research Center, Hawthorne, NY;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Year:
2007

Citing 26
Cited 15

Multi-table joins through bitmapped join indices

ACM SIGMOD Record
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Improved query performance with variant indexes

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
On saying “Enough already!” in SQL

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
An efficient bitmap encoding scheme for selection queries

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Mining the stock market (extended abstract): which measure is best?

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: concepts and techniques

Data mining: concepts and techniques
Scalability for clustering algorithms revisited

ACM SIGKDD Explorations Newsletter
Optimal aggregation algorithms for middleware

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
PREFER: a system for the efficient execution of multi-parametric ranked queries

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Encoded Bitmap Indexing for Data Warehouses

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
Model 204 Architecture and Performance

Proceedings of the 2nd International Workshop on High Performance Transaction Systems
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Evaluating Top-k Selection Queries

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Probabilistic Optimization of Top N Queries

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Automatic categorization of query results

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
RankSQL: query algebra and optimization for relational top-k queries

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Optimizing bitmap indices with efficient compression

ACM Transactions on Database Systems (TODS)
A Generalized K-Means Algorithm with Semi-Supervised Weight Coefficients

ICPR '06 Proceedings of the 18th International Conference on Pattern Recognition - Volume 03
Supporting top-K join queries in relational databases

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Bitmap indexes for large scientific data sets: a case study

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Weighted k-means for density-biased clustering

DaWaK'05 Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery

Cluster By: a new sql extension for spatial data aggregation

Proceedings of the 15th annual ACM international symposium on Advances in geographic information systems
Probabilistic ranked queries in uncertain databases

EDBT '08 Proceedings of the 11th international conference on Extending database technology: Advances in database technology
Efficient computation of personal aggregate queries on blogs

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting similarity-aware grouping in decision support systems

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Rank-aware clustering of structured datasets

Proceedings of the 18th ACM conference on Information and knowledge management
Cluster based rank query over multidimensional data streams

Proceedings of the 18th ACM conference on Information and knowledge management
Using trees to depict a forest

Proceedings of the VLDB Endowment
Grouping Results of Queries to Ontological Knowledge Bases by Conceptual Clustering

ICCCI '09 Proceedings of the 1st International Conference on Computational Collective Intelligence. Semantic Web, Social Networks and Multiagent Systems
Query Results Clustering by Extending SPARQL with CLUSTER BY

OTM '09 Proceedings of the Confederated International Workshops and Posters on On the Move to Meaningful Internet Systems: ADI, CAMS, EI2N, ISDE, IWSSA, MONET, OnToContent, ODIS, ORM, OTM Academy, SWWS, SEMELS, Beyond SAWSDL, and COMBEK 2009
Ranking weak-linked documents on the web

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
Querying streaming point clusters as regions

Proceedings of the ACM SIGSPATIAL International Workshop on GeoStreaming
On a fuzzy group-by and its use for fuzzy association rule mining

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Making interval-based clustering rank-aware

Proceedings of the 14th International Conference on Extending Database Technology
Skimmer: rapid scrolling of relational query results

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Similarity queries: their conceptual evaluation, transformations, and processing

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Boolean semantics of SQL queries cannot adequately capture the "fuzzy" preferences and "soft" criteria required in non-traditional data retrieval applications. One way to solve this problem is to add a flavor of "information retrieval" into database queries by allowing fuzzy query conditions and flexibly supporting grouping and ranking of the query results within the DBMS engine. While ranking is already supported by all major commercial DBMSs natively, support of flexibly grouping is still very limited (i.e., group-by). In this paper, we propose to generalize group-by to enable flexible grouping (clustering specifically) of the query results. Different from clustering in data mining applications, our focus is on supporting efficient clustering of Boolean results generated at query time. Moreover, we propose to integrate ranking and clustering with Boolean conditions, forming a new type of ClusterRank query to allow structured data retrieval. Such an integration is nontrivial in terms of both semantics and query processing. We investigate various semantics of this type of queries. To process such queries, a straightforward approach is to simply glue the techniques developed for ranking-only and clustering-only together. This approach is costly since both ranking and clustering are treated as blocking post-processing tasks upon Boolean query results by existing techniques. We propose a summary-based evaluation method that utilizes bitmap index to seamlessly integrate Boolean conditions, clustering, and ranking. Experimental study shows that our approach significantly outperforms the straightforward one and maintains high clustering quality.