Finding Essential Attributes from Binary Data

Authors:
Endre Boros;Takashi Horiyama;Toshihide Ibaraki;Kazuhisa Makino;Mutsunori Yagiura
Affiliations:
RUTCOR, Rutgers University, 640 Bartholomew Road, Piscataway, NJ 08854-8003, USA E-mail: boros@rutcor.rutgers.edu;Graduate School of Information Science, Nara Institute of Science and Technology, Nara 630-0101, Japan E-mail: horiyama@is.aist-nara.ac.jp;Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan E-mail: ibaraki@i.kyoto-u.ac.jp;Department of Systems and Human Science, Graduate School of Engineering Science, Osaka University, Toyonaka, Osaka 560-8531, Japan E-mail: makino@sys.es.osaka-u.ac.jp;Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan E-mail: yagiura@i.kyoto-u.ac.jp
Venue:
Annals of Mathematics and Artificial Intelligence
Year:
2003

Citing 28
Cited 6

A theory of the learnable

Communications of the ACM
Occam's razor

Information Processing Letters
Cause-effect relationships and partially defined Boolean functions

Annals of Operations Research
A modeling language for mathematical programming

Management Science
Computational learning theory: an introduction

Computational learning theory: an introduction
C4.5: programs for machine learning

C4.5: programs for machine learning
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Learning Boolean concepts in the presence of many irrelevant features

Artificial Intelligence
Learning in the presence of finitely or infinitely many irrelevant attributes

Journal of Computer and System Sciences
Randomized algorithms

Randomized algorithms
A threshold of ln n for approximating set cover (preliminary version)

STOC '96 Proceedings of the twenty-eighth annual ACM symposium on Theory of computing
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
Logical analysis of numerical data

Mathematical Programming: Series A and B - Special issue: papers from ismp97, the 16th international symposium on mathematical programming, Lausanne EPFL
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Error-free and best-fit extensions of partially defined Boolean functions

Information and Computation
Logical analysis of binary data with missing bits

Artificial Intelligence
Chow Parameters in Threshold Logic

Journal of the ACM (JACM)
The budgeted maximum coverage problem

Information Processing Letters
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
A Formalism for Relevance and Its Application in Feature Subset Selection

Machine Learning
An Implementation of Logical Analysis of Data

IEEE Transactions on Knowledge and Data Engineering
Induction of Decision Trees

Machine Learning
Queries and Concept Learning

Machine Learning
Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm

Machine Learning
Queries and Concept Learning

Machine Learning
A Monotonic Measure for Optimal Feature Selection

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Feature Selection Via Mathematical Programming

INFORMS Journal on Computing
Approximating Dense Cases of Covering Problems

Approximating Dense Cases of Covering Problems

Exact and approximate discrete optimization algorithms for finding useful disjunctions of categorical predicates in data analysis

Discrete Applied Mathematics - Discrete mathematics & data mining (DM & DM)
Performance analysis of a greedy algorithm for inferring boolean functions

Information Processing Letters
Exact and approximate discrete optimization algorithms for finding useful disjunctions of categorical predicates in data analysis

Discrete Applied Mathematics
Performance analysis of a greedy algorithm for inferring Boolean functions

Information Processing Letters
An Improved Branch-and-Bound Method for Maximum Monomial Agreement

INFORMS Journal on Computing
When does greedy learning of relevant attributes succeed?: a fourier-based characterization

COCOON'07 Proceedings of the 13th annual international conference on Computing and Combinatorics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider data sets that consist of n-dimensional binary vectors representing positive and negative examples for some (possibly unknown) phenomenon. A subset S of the attributes (or variables) of such a data set is called a support set if the positive and negative examples can be distinguished by using only the attributes in S. In this paper we study the problem of finding small support sets, a frequently arising task in various fields, including knowledge discovery, data mining, learning theory, logical analysis of data, etc. We study the distribution of support sets in randomly generated data, and discuss why finding small support sets is important. We propose several measures of separation (real valued set functions over the subsets of attributes), formulate optimization models for finding the smallest subsets maximizing these measures, and devise efficient heuristic algorithms to solve these (typically NP-hard) optimization problems. We prove that several of the proposed heuristics have a guaranteed constant approximation ratio, and we report on computational experience comparing these heuristics with some others from the literature both on randomly generated and on real world data sets.