Nearly homogeneous multi-partitioning with a deterministic generator

Authors:
Michaël Aupetit
Affiliations:
CEA, DAM, DIF, F-91297 Arpajon, France
Venue:
Neurocomputing
Year:
2009

Citing 8
Cited 2

Efficient algorithms for finding maximum matching in graphs

ACM Computing Surveys (CSUR)
An optimal minimum spanning tree algorithm

Journal of the ACM (JACM)
Information Theoretic Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Applications of sampling and fractional factorial designs to model-free data squashing

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Neural Networks

Neural Networks
Magnification control for batch neural gas

Neurocomputing
Arbitrarily shaped multiple spatial cluster detection for case event data

Computational Statistics & Data Analysis
When the greedy algorithm fails

Discrete Optimization

Quantified Score

Hi-index	0.02

Visualization

Abstract

The need for homogeneous partitions, where all parts have the same distribution, is ubiquitous in machine learning and in other fields of scientific studies. Especially when only few partitions can be generated. In that case, validation sets need to be distributed the same way as training sets to get good estimates of models' complexities. And when standard data analysis tools cannot deal with too large data sets, the analysis could be performed onto a smaller subset, as far as its homogeneity to the larger one is good enough to get relevant results. However, pseudo-random generators may generate partitions whose parts have very different distributions because the geometry of the data is ignored. In this work, we propose an algorithm which deterministically generates partitions whose parts have empirically greater homogeneity on average than parts arising from pseudo-random partitions. The data to partition are seriated based on a nearest neighbor rule, and assigned to a part of the partition according to their rank in this seriation. We demonstrate the efficiency of this algorithm on toys and real data sets. Since this algorithm is deterministic, it also provides a way to make reproducible machine learning experiments usually based on pseudo-random partitions.