Mapping high-dimensional data onto a relative distance plane: an exact method for visualizing and characterizing high-dimensional patterns

  • Authors:
  • R. L. Somorjai;B. Dolenko;A. Demko;M. Mandelzweig;A. E. Nikulin;R. Baumgartner;N. J. Pizzi

  • Affiliations:
  • Institute for Biodiagnosties, National Research Council Canada, 435 Ellice Avenue, Winnipeg MB, Canada R3B 1Y6;Institute for Biodiagnosties, National Research Council Canada, 435 Ellice Avenue, Winnipeg MB, Canada R3B 1Y6;Institute for Biodiagnosties, National Research Council Canada, 435 Ellice Avenue, Winnipeg MB, Canada R3B 1Y6;Institute for Biodiagnosties, National Research Council Canada, 435 Ellice Avenue, Winnipeg MB, Canada R3B 1Y6;Institute for Biodiagnosties, National Research Council Canada, 435 Ellice Avenue, Winnipeg MB, Canada R3B 1Y6;Institute for Biodiagnosties, National Research Council Canada, 435 Ellice Avenue, Winnipeg MB, Canada R3B 1Y6;Institute for Biodiagnosties, National Research Council Canada, 435 Ellice Avenue, Winnipeg MB, Canada R3B 1Y6

  • Venue:
  • Journal of Biomedical Informatics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We introduce a distance (similarity)--based mapping for the visualization of high-dimensional patterns and their relative relationships. The mapping preserves exactly the original distances between points with respect to any two reference patterns in a special two-dimensional coordinate system, the relative distance plane (RDP). As only a single calculation of a distance matrix is required, this method is computationally efficient, an essential requirement for any exploratory data analysis. The data visualization afforded by this representation permits a rapid assessment of class pattern distributions. In particular, we can determine with a simple statistical test whether both training and validation sets of a 2-class, high-dimensional dataset derive from the same class distributions. We can explore any dataset in detail by identifying the subset of reference pairs whose members belong to different classes, cycling through this subset, and for each pair, mapping the remaining patterns. These multiple viewpoints facilitate the identification and confirmation of outliers. We demonstrate the effectiveness of this method on several complex biomedical datasets. Because of its efficiency, effectiveness, and versatility, one may use the RDP representation as an initial, data mining exploration that precedes classification by some classifier. Once final enhancements to the RDP mapping software are completed, we plan to make it freely available to researchers.