Enhancing Clustering Quality through Landmark-Based Dimensionality Reduction

Authors:
Panagis Magdalinos;Christos Doulkeridis;Michalis Vazirgiannis
Affiliations:
Athens University of Economics and Business;Norwegian University of Science and Technology;Athens University of Economics and Business
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2011

Citing 25
Cited 0

FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Using Discriminant Eigenfeatures for Image Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
Database-friendly random projections

PODS '01 Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Chord: A scalable peer-to-peer lookup service for internet applications

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
A scalable content-addressable network

Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications
Matrix algorithms

Matrix algorithms
An elementary proof of a theorem of Johnson and Lindenstrauss

Random Structures & Algorithms
Mining the Web: Discovering Knowledge from HyperText Data

Mining the Web: Discovering Knowledge from HyperText Data
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
Collective Principal Component Analysis from Distributed, Heterogeneous Data

PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Properties of Embedding Methods for Similarity Searching in Metric Spaces

IEEE Transactions on Pattern Analysis and Machine Intelligence
Cluster-preserving Embedding of Proteins

Cluster-preserving Embedding of Proteins
IDR/QR: an incremental dimension reduction algorithm via QR decomposition

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform

Proceedings of the thirty-eighth annual ACM symposium on Theory of computing
Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition

SIAM Journal on Computing
A Triangulation Method for the Sequential Mapping of Points from N-Space to Two-Space

IEEE Transactions on Computers
BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval

IEEE Transactions on Pattern Analysis and Machine Intelligence
Rotational Linear Discriminant Analysis Technique for Dimensionality Reduction

IEEE Transactions on Knowledge and Data Engineering
Nonparametric Discriminant Analysis for Face Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
General Cost Models for Evaluating Dimensionality Reduction in High-Dimensional Spaces

IEEE Transactions on Knowledge and Data Engineering
Faster dimension reduction

Communications of the ACM
K-Landmarks: distributed dimensionality reduction for clustering quality maintenance

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
MetricMap: an embedding technique for processing distance-based queries in metric spaces

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics
DESENT: decentralized and distributed semantic overlay generation in P2P networks

IEEE Journal on Selected Areas in Communications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Scaling up data mining algorithms for data of both high dimensionality and cardinality has been lately recognized as one of the most challenging problems in data mining research. The reason is that typical data mining tasks, such as clustering, cannot produce high quality results when applied on high-dimensional and/or large (in terms of cardinality) datasets. Data preprocessing and in particular dimensionality reduction constitute promising tools to deal with this problem. However, most of the existing dimensionality reduction algorithms share also the same disadvantages with data mining algorithms, when applied on large datasets of high dimensionality. In this article, we propose a fast and efficient dimensionality reduction algorithm (FEDRA), which is particularly scalable and therefore suitable for challenging datasets. FEDRA follows the landmark-based paradigm for embedding data objects in a low-dimensional projection space. By means of a theoretical analysis, we prove that FEDRA is efficient, while we demonstrate the achieved quality of results through experiments on datasets of higher cardinality and dimensionality than those employed in the evaluation of competitive algorithms. The obtained results prove that FEDRA manages to retain or ameliorate clustering quality while projecting in less than 10% of the initial dimensionality. Moreover, our algorithm produces embeddings that enable the faster convergence of clustering algorithms. Therefore, FEDRA emerges as a powerful and generic tool for data pre-processing, which can be integrated in other data mining algorithms, thus enhancing their performance.