Editorial: Large scale instance selection by means of federal instance selection

  • Authors:
  • Aida de Haro-García;Nicolás García-Pedrajas;Juan Antonio Romero del Castillo

  • Affiliations:
  • -;-;-

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Instance selection is becoming more and more relevant due to the huge amount of data that is constantly being produced. However, although current algorithms are useful for fairly large datasets, many scaling problems are found when the number of instances is hundreds of thousands or millions. Most of the widely used instance selection algorithms are of complexity at least O(n^2), n being the number of instances. When we face very large problems, the scalability becomes an issue, and most of the algorithms are not applicable. This paper presents a methodology for scaling up instance selection algorithms by means of a parallel procedure that performs instance selection on small subsets of the original dataset. The results obtained with the application of instance selection to small subsets are combined using a voting scheme. The method achieves a very good performance in terms of testing error and storage reduction, while the execution time of the process is decreased very significantly. The parallel algorithm also removes any kind of constraint imposed by memory size, as the whole dataset does not need to be stored in memory. The usefulness of our method is shown by an extensive comparison using 35 datasets of medium and large sizes from the UCI Machine Learning Repository. Additionally, our method is applied to eight very large datasets with very good results and fast execution time.