Towards systematic design of distance functions for data mining applications

Authors:
Charu C. Aggarwal
Affiliations:
IBM T. J. Watson Research Center, Hawthorne, NY
Venue:
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2003

Citing 13
Cited 12

Classification algorithms

Classification algorithms
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The IGrid index: reversing the dimensionality curse for similarity indexing in high dimensional space

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Re-designing distance functions and distance-based applications for high dimensional data

ACM SIGMOD Record
Time series similarity measures and time series indexing (abstract only)

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
HD-Eye: Visual Mining of High-Dimensional Data

IEEE Computer Graphics and Applications
Distinctiveness-Sensitive Nearest Neighbor Search for Efficient Similarity Retrieval of Multimedia Information

Proceedings of the 17th International Conference on Data Engineering
Scaling up Dynamic Time Warping to Massive Dataset

PKDD '99 Proceedings of the Third European Conference on Principles of Data Mining and Knowledge Discovery
What Is the Nearest Neighbor in High Dimensional Spaces?

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
FALCON: Feedback Adaptive Loop for Content-Based Retrieval

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Efficient User-Adaptable Similarity Search in Large Multimedia Databases

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Formulating distance functions via the kernel trick

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Formulating context-dependent similarity functions

Proceedings of the 13th annual ACM international conference on Multimedia
On Learning Asymmetric Dissimilarity Measures

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
Advanced visualization of self-organizing maps with vector fields

Neural Networks - 2006 Special issue: Advances in self-organizing maps--WSOM'05
Decision support systems for police: lessons from the application of data mining techniques to "soft" forensic evidence

Artificial Intelligence and Law
Integration of well posedness analysis in software engineering

Proceedings of the 2007 ACM symposium on Applied computing
SCHISM: a new approach to interesting subspace mining

International Journal of Business Intelligence and Data Mining
Metric space similarity joins

ACM Transactions on Database Systems (TODS)
Estimating Sales Opportunity Using Similarity-Based Methods

ECML PKDD '08 Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases - Part II
Incremental clustering of dynamic data streams using connectivity based representative points

Data & Knowledge Engineering
Making class bias useful: a strategy of learning from imbalanced data

IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning
Distance metrics for high dimensional nearest neighborhood recovery: Compression and normalization

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distance function computation is a key subtask in many data mining algorithms and applications. The most effective form of the distance function can only be expressed in the context of a particular data domain. It is also often a challenging and non-trivial task to find the most effective form of the distance function. For example, in the text domain, distance function design has been considered such an important and complex issue that it has been the focus of intensive research over three decades. The final design of distance functions in this domain has been reached only by detailed empirical testing and consensus over the quality of results provided by the different variations. With the increasing ability to collect data in an automated way, the number of new kinds of data continues to increase rapidly. This makes it increasingly difficult to undertake such efforts for each and every new data type. The most important aspect of distance function design is that since a human is the end-user for any application, the design must satisfy the user requirements with regard to effectiveness. This creates the need for a systematic framework to design distance functions which are sensitive to the particular characteristics of the data domain. In this paper, we discuss such a framework. The goal is to create distance functions in an automated waywhile minimizing the work required from the user. We will show that this framework creates distance functions which are significantly more effective than popularly used functions such as the Euclidean metric.