Component Selection to Optimize Distance Function Learning in Complex Scientific Data Sets

Authors:
Aparna Varde;Stephen Bique;Elke Rundensteiner;David Brown;Jianyu Liang;Richard Sisson;Ehsan Sheybani;Brian Sayre
Affiliations:
Department of Math and Computer Science, Virginia State University, Petersburg;Naval Research Laboratory, Washington;Department of Computer Science, Worcester Polytechnic Institute, Worcester;Department of Computer Science, Worcester Polytechnic Institute, Worcester and Department of Mechanical Engineering, Worcester Polytechnic Institute, Worcester;Department of Mechanical Engineering, Worcester Polytechnic Institute, Worcester;Department of Mechanical Engineering, Worcester Polytechnic Institute, Worcester and Center for Heat Treating Excellence, Metal Processing Institute, Worcester;Department of Engineering and Technology, Virginia State University, Petersburg;Department of Biology, Virginia State University, Petersburg
Venue:
DEXA '08 Proceedings of the 19th international conference on Database and Expert Systems Applications
Year:
2008

Citing 17
Cited 0

FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Neural networks for pattern recognition

Neural networks for pattern recognition
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Data mining: concepts and techniques

Data mining: concepts and techniques
Tri-plots: scalable tools for multidimensional data mining

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Machine Learning

Machine Learning
Ensembling neural networks: many could be better than all

Artificial Intelligence
Finding Similar Time Series

PKDD '97 Proceedings of the First European Symposium on Principles of Data Mining and Knowledge Discovery
MindReader: Querying Databases Through Multiple Examples

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Similarity Search in Multimedia Databases

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Learning semantics-preserving distance metrics for clustering graphical data

MDM '05 Proceedings of the 6th international workshop on Multimedia data mining: mining integrated media and complex data
AutoDomainMine: a graphical data mining system for process optimization

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
On the marriage of Lp-norms and edit distance

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Learning the Relative Importance of Features in Image Data

ICDEW '07 Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop
A learning machine: part I

IBM Journal of Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Analyzing complex scientific data, e.g., graphs and images, often requires comparison of features: regions on graphs, visual aspects of images and related metadata, some features being relatively more important. The notion of similarity for comparison is typically distance between data objects which could be expressed as distance between features. We refer to distance based on each feature as a component. Weights of components representing relative importance of features could be learned using distance function learning algorithms. However, it is seldom known which components optimize learning, given criteria such as accuracy, efficiency and simplicity. This is the problem we address. We propose and theoretically compare four component selection approaches: Maximal Path Traversal, Minimal Path Traversal, Maximal Path Traversal with Pruning and Minimal Path Traversal with Pruning. Experimental evaluation is conducted using real data from Materials Science, Nanotechnology and Bioinformatics. A trademarked software tool is developed as a highlight of this work.