Finding Essential Attributes from Binary Data

  • Authors:
  • Endre Boros;Takashi Horiyama;Toshihide Ibaraki;Kazuhisa Makino;Mutsunori Yagiura

  • Affiliations:
  • RUTCOR, Rutgers University, 640 Bartholomew Road, Piscataway, NJ 08854-8003, USA E-mail: boros@rutcor.rutgers.edu;Graduate School of Information Science, Nara Institute of Science and Technology, Nara 630-0101, Japan E-mail: horiyama@is.aist-nara.ac.jp;Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan E-mail: ibaraki@i.kyoto-u.ac.jp;Department of Systems and Human Science, Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka 560-8531, Japan E-mail: makino@sys.es.osaka-u.ac.jp;Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan E-mail: yagiura@i.kyoto-u.ac.jp

  • Venue:
  • Annals of Mathematics and Artificial Intelligence
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider data sets that consist of n-dimensional binary vectors representing positive and negative examples for some (possibly unknown) phenomenon. A subset S of the attributes (or variables) of such a data set is called a support set if the positive and negative examples can be distinguished by using only the attributes in S. In this paper we study the problem of finding small support sets, a frequently arising task in various fields, including knowledge discovery, data mining, learning theory, logical analysis of data, etc. We study the distribution of support sets in randomly generated data, and discuss why finding small support sets is important. We propose several measures of separation (real valued set functions over the subsets of attributes), formulate optimization models for finding the smallest subsets maximizing these measures, and devise efficient heuristic algorithms to solve these (typically NP-hard) optimization problems. We prove that several of the proposed heuristics have a guaranteed constant approximation ratio, and we report on computational experience comparing these heuristics with some others from the literature both on randomly generated and on real world data sets.