Designing semantics-preserving cluster representatives for scientific input conditions

Authors:
Aparna S. Varde;Elke A. Rundensteiner;Carolina Ruiz;David C. Brown;Mohammmed Maniruzzaman;Richard D. Sisson
Affiliations:
Worcester Polytechnic Institute, Worcester, MA;Worcester Polytechnic Institute, Worcester, MA;Worcester Polytechnic Institute, Worcester, MA;Worcester Polytechnic Institute, Worcester, MA;Worcester Polytechnic Institute, Worcester, MA;Worcester Polytechnic Institute, Worcester, MA
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 7
Cited 2

Data mining: concepts and techniques

Data mining: concepts and techniques
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
A new approach to unsupervised text summarization

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Induction of Decision Trees

Machine Learning
Clustering Association Rules

ICDE '97 Proceedings of the Thirteenth International Conference on Data Engineering
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
An objective evaluation criterion for clustering

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

AutoDomainMine: a graphical data mining system for process optimization

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Semantically-grounded construction of centroids for datasets with textual attributes

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In scientific domains, knowledge is often discovered from experiments by grouping or clustering them based on the similarity of their output. The causes of similarity are analyzed based on the input conditions characterizing a given type of output, i.e., a given cluster. This analysis helps in applications such as decision support in industry. Cluster representatives form at-a-glance depictions for such applications. Randomly selecting a set of conditions in a cluster as its representative is not sufficient since distinct combinations of inputs could lead to the same cluster. In this paper, an approach called DesCond is proposed to design semantics-preserving cluster representatives for scientific input conditions. We define a notion of distance for conditions to capture semantics based on the types of their attributes and their relative importance. Using this distance, methods of building candidate cluster representatives with different levels of detail are proposed. Candidates are compared using the DesCond Encoding proposed in this paper that assesses their complexity and information loss, given user interests. The candidate with the lowest encoding for each cluster is returned as its designed representative. DesCond is evaluated with real data from Materials Science. Evaluation with domain expert interviews and formal user surveys shows that designed representatives consistently outperform randomly selected ones and different candidates suit different users.