Making interval-based clustering rank-aware

Authors:
Julia Stoyanovich;Sihem Amer-Yahia;Tova Milo
Affiliations:
University of Pennsylvania, Philadelphia, PA;Yahoo! Research, New York, NY;Tel Aviv University, Tel Aviv, Israel
Venue:
Proceedings of the 14th International Conference on Extending Database Technology
Year:
2011

Citing 27
Cited 1

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The use of MMR, diversity-based reranking for reordering documents and producing summaries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Experience with personalization of Yahoo!

Communications of the ACM
Clustering through decision tree construction

Proceedings of the ninth international conference on Information and knowledge management
Evaluating document clustering for interactive information retrieval

Proceedings of the tenth international conference on Information and knowledge management
A new cell-based clustering method for large, high-dimensional data in data mining applications

Proceedings of the 2002 ACM symposium on Applied computing
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
A hierarchical monothetic document clustering algorithm for summarization and browsing search results

Proceedings of the 13th international conference on World Wide Web
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
The complexity of mining maximal frequent itemsets and maximal frequent patterns

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Sampling search-engine results

WWW '05 Proceedings of the 14th international conference on World Wide Web
Being accurate is not enough: how accuracy metrics have hurt recommender systems

CHI '06 Extended Abstracts on Human Factors in Computing Systems
Ordering the attributes of query results

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Improving web search ranking by incorporating user behavior information

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting redundancy-aware top-k patterns

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Supporting ranking and clustering as generalized order-by and group-by

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Addressing diverse user preferences in SQL-query-result navigation

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Probabilistic ranking of database query results

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Introduction to recommender systems

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Topical query decomposition

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
It takes variety to make a world: diversification in recommender systems

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
RankClus: integrating clustering with ranking for heterogeneous information network analysis

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Efficient Computation of Diverse Query Results

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Rank-aware clustering of structured datasets

Proceedings of the 18th ACM conference on Information and knowledge management
DiRec: Diversified recommendations for semantic-less Collaborative Filtering

ICDE '11 Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

Diversification and refinement in collaborative filtering recommender

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In online applications, such as online dating, users often query and rank large collections of structured items. Top results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters in the result space, identified by a combination of attributes that correlate with rank. Such clusters may describe matches between 35 and 40 with an MBA, matches between 25 and 30 who work in the software industry, etc., allowing for data exploration of ranked results. We refer to the problem of finding such clusters as rank-aware interval-based clustering and argue that it is not addressed by standard clustering algorithms. We formally define the problem and, to solve it, propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and we present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. We validate the effectiveness of our approach with a large-scale user study, and perform an extensive experimental evaluation of efficiency, demonstrating that our methods are practical on the large scale. Our evaluation is performed on large datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.