Relationship-based clustering and cluster ensembles for high-dimensional data mining

Authors:
Alexander Strehl;Joydeep Ghosh
Affiliations:
-;-
Venue:
Relationship-based clustering and cluster ensembles for high-dimensional data mining
Year:
2002

Citing 0
Cited 50

A cluster ensembles framework

Design and application of hybrid intelligent systems
Distributional term representations: an experimental comparison

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Text Mining Biomedical Literature for Discovering Gene-to-Gene Relationships: A Comparative Study of Algorithms

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Aggregating inconsistent information: ranking and clustering

Proceedings of the thirty-seventh annual ACM symposium on Theory of computing
Clustering quality based feature selection method

Machine Graphics & Vision International Journal
Measuring intrusion detection capability: an information-theoretic approach

ASIACCS '06 Proceedings of the 2006 ACM Symposium on Information, computer and communications security
Feature diversity in cluster ensembles for robust document clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Efficient agent-based cluster ensembles

AAMAS '06 Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems
Minimum sum-squared residue for fuzzy co-clustering

Intelligent Data Analysis
Chinese verb sense discrimination using an EM clustering model with rich linguistic features

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A web-based tutoring system with styles-matching strategy for spatial geometric transformation

Interacting with Computers
Automated extraction of behavioural profiles from document usage

BT Technology Journal
Aggregation of partial rankings, p-ratings and top-m lists

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
k-ANMI: A mutual information based clustering algorithm for categorical data

Information Fusion
Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes

International Journal of Data Mining and Bioinformatics
Wireless sensor network aided search and rescue in trails

Proceedings of the 2nd international conference on Scalable information systems
Extracting and ranking viral communities using seeds and content similarity

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia
Correlated pattern mining in quantitative databases

ACM Transactions on Database Systems (TODS)
Representation and dimensionality reduction of semantically enriched clickstreams

Ph.D. '08 Proceedings of the 2008 EDBT Ph.D. workshop
Multisource images analysis using collaborative clustering

EURASIP Journal on Advances in Signal Processing
An information-theoretic approach to quantitative association rule mining

Knowledge and Information Systems
Ensemble clustering with voting active clusters

Pattern Recognition Letters
Resampling-based selective clustering ensembles

Pattern Recognition Letters
Address block segmentation using ensemble-clustering techniques

CompSysTech '08 Proceedings of the 9th International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing
Automated construction of web accessibility models from transaction click-streams

Proceedings of the 18th international conference on World wide web
A comparison of extrinsic clustering evaluation metrics based on formal constraints

Information Retrieval
Correlation Clustering Revisited: The "True" Cost of Error Minimization Problems

ICALP '09 Proceedings of the 36th International Colloquium on Automata, Languages and Programming: Part I
Estimating the number of clusters via system evolution for cluster analysis of gene expression data

IEEE Transactions on Information Technology in Biomedicine - Special section on computational intelligence in medical systems
Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors

MCS'03 Proceedings of the 4th international conference on Multiple classifier systems
QC4: a clustering evaluation method

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Active contours as knowledge discovery methods

DS'07 Proceedings of the 10th international conference on Discovery science
Autonomous news clustering and classification for an intelligent web portal

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Instance based clustering of semantic web resources

ESWC'08 Proceedings of the 5th European semantic web conference on The semantic web: research and applications
Exploiting tree structure of a web page for clustering

International Journal of Knowledge and Web Intelligence
Medical case retrieval from a committee of decision trees

IEEE Transactions on Information Technology in Biomedicine
Robust clustering using discriminant analysis

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Extracting local web communities using lexical similarity

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
An efficient preprocessing stage for the relationship-based clustering framework

Intelligent Data Analysis
A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Research on Language and Computation
On measuring forgery quality in online signatures

Pattern Recognition
Tightly coupling visual and linguistic features for enriching audio-based web browsing experience

Proceedings of the 20th ACM international conference on Information and knowledge management
Cluster generation and cluster labelling for web snippets: a fast and accurate hierarchical solution

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
A method for similarity-based grouping of biological data

DILS'06 Proceedings of the Third international conference on Data Integration in the Life Sciences
Transaction models for Web accessibility

World Wide Web
Ontology learning from text: A look back and into the future

ACM Computing Surveys (CSUR)
Thematic organization of web content for distraction-free text-to-speech narration

Proceedings of the 14th international ACM SIGACCESS conference on Computers and accessibility
Multiple perspective interactive search: a paradigm for exploratory search and information retrieval on the web

Multimedia Tools and Applications
Semi-metric Networks for Recommender Systems

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 03
LiveAction: Automating Web Task Model Generation

ACM Transactions on Interactive Intelligent Systems (TiiS)
Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

Proceedings of the Fourth Symposium on Information and Communication Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

This dissertation takes a relationship-based approach to cluster analysis of high (1000 and more) dimensional data that side-steps the ‘curse of dimensionality’ issue by working in a suitable similarity space instead of the original feature space. We propose two frameworks that leverage graph algorithms to achieve relationship-based clustering and visualization, respectively. In the visualization framework, the output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in 2 dimensions, with clusters showing up as bands. Results on retail transaction, document (bag-of-words), and web-log data show that our approach can yield superior results while also taking additional balance constraints into account. The choice of similarity is a critical step in relationship-based clustering and this motivates our systematic comparative study of the impact of similarity measures on the quality of document clusters . The key findings of our experimental study are: (i) Cosine, correlation, and extended Jaccard similarities perform comparably; (ii) Euclidean distances do not work well; (iii) graph partitioning tends to be superior to k-means and SOMs especially when balanced clusters are desired; and (iv) performance curves generally do not cross. We also propose a cluster quality evaluation measure based on normalized mutual information and find an analytical relation between similarity measures. It is widely recognized that combining multiple classification or regression models typically provides superior results compared to using a single, well-tuned model. However, there are no well known approaches to combining multiple clusterings. The idea of combining cluster labelings without accessing the original features leads to a general knowledge reuse framework that we call cluster ensembles. We propose a formal definition of the cluster ensemble as an optimization problem. Taking a relationship-based approach we propose three effective and efficient combining algorithms for solving it heuristically based on a hypergraph model. Results on synthetic as well as real data-sets show that cluster ensembles can (i) improve quality and robustness, and (ii) enable distributed clustering, and (iii) speed up processing significantly with little loss in quality.