A mixed integer linear model for clustering with variable selection

Authors:
Stefano Benati;Sergio García
Affiliations:
-;-
Venue:
Computers and Operations Research
Year:
2014

Citing 7
Cited 0

Cluster analysis and mathematical programming

Mathematical Programming: Series A and B - Special issue: papers from ismp97, the 16th international symposium on mathematical programming, Lausanne EPFL
Heuristic Methods for Large Centroid Clustering Problems

Journal of Heuristics
Simultaneous Feature Selection and Clustering Using Mixture Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
A scatter search approach for the minimum sum-of-squares clustering problem

Computers and Operations Research
Categorical data fuzzy clustering: An analysis of local search heuristics

Computers and Operations Research
The Academic Journal Ranking Problem: A Fuzzy-Clustering Approach

Journal of Classification
Solving Large p-Median Problems with a Radius Formulation

INFORMS Journal on Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper introduces an extension of the p-median problem in which the distance function between units is calculated as the distance sum on the q most important variables out of a set of size m. This model has applications in cluster analysis (for example, in sociological surveys), where analysts have a large list of variables available for inclusion, but only a subset of them (true variables) is appropriate for uncovering the cluster structure. Therefore, researchers must carefully separate the true variables from the other before computing data partitions. Here we show that this problem can be formulated as a mixed integer non-linear optimization model where clustering and variable selection are done simultaneously. Then we provide two different linearizations and compare their performance with the default method of clustering with all the variables (which is a p-median model) on a set of artificially generated binary data, showing that the model based on a radius formulation performs the best.