A divisive ordering algorithm for mapping categorical data to numeric data

Authors:
Huang-Cheng Kuo
Affiliations:
Department of Computer Science and Information Engineering, National Chiayi University, Chiayi City, Taiwan
Venue:
KES'05 Proceedings of the 9th international conference on Knowledge-Based Intelligent Information and Engineering Systems - Volume Part II
Year:
2005

Citing 8
Cited 0

Implementing agglomerative hierarchic clustering algorithms for use in document retrieval

Information Processing and Management: an International Journal
Selecting typical instances in instance-based learning

ML92 Proceedings of the ninth international workshop on Machine learning
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density-based indexing for approximate nearest-neighbor queries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding the optimal gene order in displaying microarray data

GECCO'03 Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartII

Quantified Score

Hi-index	0.00

Visualization

Abstract

The amount of computing time for K Nearest Neighbor Search is linear to the size of the dataset if the dataset is not indexed. This is not endurable for on-line applications with time constraints when the dataset is large. However, if there are categorical attributes in the dataset, an index cannot be built on the dataset. One possible solution to index such datasets is to convert categorical attributes into numeric attributes. Categories are ordered and then are mapped to numeric values. In this paper, we propose a new heuristic ordering algorithm to compare with two previously proposed algorithms that borrow the idea from minimal spanning trees. The new algorithm divisively builds a binary tree by recursively partitioning the categories. Then, we in-order traverse the tree and get an ordering of the categories. After mapping and indexing, we can efficiently retrieve a small portion of the dataset and perform K nearest neighbor search on the portion at the cost of a little bit of accuracy. Experiments show the divisive ordering algorithm performs better than the other two algorithms.