A Mixed Similarity Measure in Near-Linear Computational Complexity for Distance-Based Methods

Authors:
Ngoc Binh Nguyen;Tu Bao Ho
Affiliations:
-;-
Venue:
PKDD '00 Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2000

Citing 1
Cited 1

C4.5: programs for machine learning

C4.5: programs for machine learning

Discovering missing links in large-scale linked data

ACIIDS'13 Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many methods of knowledge discovery and data mining are distance-based such as nearest neighbor classification or clustering where similarity measures between objects play an essential role. While real-world databases are often heterogeneous with mixed numeric and symbolic attributes, most available similarity measures can only be applied to either symbolic or numeric data. In such cases, data mining methods often require transforming numeric data into symbolic ones by discretization techniques. Mixed similarity measures (MSMs) without discretization of numeric values are desirable alternatives for objects with mixed symbolic and numeric data. However, the time and space complexities of computing available MSMs are often very high that make MSMs not applicable to large datasets. In the framework of Goodall's MSM inspired by biological taxonomy, computing methods have been done but their time and space complexities so far are at least O(n2 log n2) and O(n2), respectively. In this work, we propose a new and efficient method for computing this MSM with O(n log n) time and O(n) space complexities. We demonstrate experimentally the applicability of new method to large datasets and suggest meta-knowledge on the use of this MSM. Practically, the experimental results show that only the near-linear time and space MSM could be applicable to mining large heterogeneous datasets.