Implementing agglomerative hierarchic clustering algorithms for use in document retrieval
Information Processing and Management: an International Journal
Selecting typical instances in instance-based learning
ML92 Proceedings of the ninth international workshop on Machine learning
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
CACTUS—clustering categorical data using summaries
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Density-based indexing for approximate nearest-neighbor queries
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
R-trees: a dynamic index structure for spatial searching
SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
Clustering Categorical Data: An Approach Based on Dynamical Systems
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Finding the optimal gene order in displaying microarray data
GECCO'03 Proceedings of the 2003 international conference on Genetic and evolutionary computation: PartII
Hi-index | 0.00 |
The amount of computing time for K Nearest Neighbor Search is linear to the size of the dataset if the dataset is not indexed. This is not endurable for on-line applications with time constraints when the dataset is large. However, if there are categorical attributes in the dataset, an index cannot be built on the dataset. One possible solution to index such datasets is to convert categorical attributes into numeric attributes. Categories are ordered and then are mapped to numeric values. In this paper, we propose a new heuristic ordering algorithm to compare with two previously proposed algorithms that borrow the idea from minimal spanning trees. The new algorithm divisively builds a binary tree by recursively partitioning the categories. Then, we in-order traverse the tree and get an ordering of the categories. After mapping and indexing, we can efficiently retrieve a small portion of the dataset and perform K nearest neighbor search on the portion at the cost of a little bit of accuracy. Experiments show the divisive ordering algorithm performs better than the other two algorithms.