The Depth Problem: Identifying the Most Representative Units in a Data Group

Authors:
Itziar Irigoien;Francesc Mestres;Concepcion Arenas
Affiliations:
University of the Basque Country, Donostia;University of Barcelona, Barcelona;University of Barcelona, Barcelona
Venue:
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Year:
2013

Citing 11
Cited 0

A continuous metric scaling solution for a random variable

Journal of Multivariate Analysis
Self-organizing maps

Self-organizing maps
Clustering Algorithms

Clustering Algorithms
Computing location depth and regression depth in higher dimensions

Statistics and Computing
Clustering and classification based on the L1data depth

Journal of Multivariate Analysis
Constrained clusters of gene expression profiles with pathological features

Bioinformatics
Integrating gene expression profiling and clinical data

International Journal of Approximate Reasoning
The expected convex hull trimmed regions of a sample

Computational Statistics
Isolation Forest

ICDM '08 Proceedings of the 2008 Eighth IEEE International Conference on Data Mining
GEVA: geometric variability-based approaches for identifying patterns in data

Computational Statistics
Microarray Time Course Experiments: Finding Profiles

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a solution to the problem of how to identify the units in groups or clusters that have the greatest degree of centrality and best characterize each group. This problem frequently arises in the classification of data such as types of tumor, gene expression profiles or general biomedical data. It is particularly important in the common context that many units do not properly belong to any cluster. Furthermore, in gene expression data classification, good identification of the most central units in a cluster enables recognition of the most important samples in a particular pathological process. We propose a new depth function that allows us to identify central units. As our approach is based on a measure of distance or dissimilarity between any pair of units, it can be applied to any kind of multivariate data (continuous, binary or multiattribute data). Therefore, it is very valuable in many biomedical applications, which usually involve noncontinuous data, such as clinical, pathological, or biological data sources. We validate the approach using artificial examples and apply it to empirical data. The results show the good performance of our statistical approach.