Feature selection in enterprise analytics: a demonstration using an R-based data analytics system

  • Authors:
  • Pradap Konda;Arun Kumar;Christopher Ré;Vaishnavi Sashikanth

  • Affiliations:
  • Department of Computer Sciences, University of Wisconsin-Madison;Department of Computer Sciences, University of Wisconsin-Madison;Stanford University;Advanced Analytics, Oracle

  • Venue:
  • Proceedings of the VLDB Endowment
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Enterprise applications are analyzing ever larger amounts of data using advanced analytics techniques. Recent systems from Oracle, IBM, and SAP integrate R with a data processing system to support richer advanced analytics on large data. A key step in advanced analytics applications is feature selection, which is often an iterative process that involves statistical algorithms and data manipulations. From our conversations with data scientists and analysts at enterprise settings, we observe three key aspects about feature selection. First, feature selection is performed by many types of users, not just data scientists. Second, high performance is critical to perform feature selection processes on large data. Third, the provenance of the results and steps in feature selection processes needs to be tracked for purposes of transparency and auditability. Based on our discussions with data scientists and the literature on feature selection practice, we organize a set of operations for feature selection into the Columbus framework. We prototype Columbus as a library usable in the Oracle R Enterprise environment. In this demonstration, we use Columbus to showcase how we can support various types of users of feature selection in one system. We then show how we optimize performance and manage the provenance of feature selection processes.