Data abstractions for decision tree induction

Authors:
Yoshimitsu Kudoh;Makoto Haraguchi;Yoshiaki Okubo
Affiliations:
Division of Electronics and Information Engineering, Hokkaido University, N 13 W 8, Sapporo 060-8628, Japan;Division of Electronics and Information Engineering, Hokkaido University, N 13 W 8, Sapporo 060-8628, Japan;Division of Electronics and Information Engineering, Hokkaido University, N 13 W 8, Sapporo 060-8628, Japan
Venue:
Theoretical Computer Science
Year:
2003

Citing 9
Cited 1

Abstraction in planning

Reasoning about plans
C4.5: programs for machine learning

C4.5: programs for machine learning
Data mining

Data mining
Advances in knowledge discovery and data mining

Advances in knowledge discovery and data mining
From data mining to knowledge discovery: an overview

Advances in knowledge discovery and data mining
Attribute-oriented induction in data mining

Advances in knowledge discovery and data mining
Machine Learning and Data Mining; Methods and Applications

Machine Learning and Data Mining; Methods and Applications
Knowledge Discovery in Databases: An Attribute-Oriented Approach

VLDB '92 Proceedings of the 18th International Conference on Very Large Data Bases
Architectural Support for Data Mining.

Architectural Support for Data Mining.

Explanation-based learning to recognize network malfunctions

Information-Knowledge-Systems Management

Quantified Score

Hi-index	5.23

Visualization

Abstract

When descriptions of data values in a database are too concrete or too detailed, the computational complexity needed to discover useful knowledge from the database will be generally increased. Furthermore, discovered knowledge tends to become complicated. A notion of data abstraction seems useful to resolve this kind of problems, as we obtain a smaller and more general database after the abstraction, from which we can quickly extract more abstract knowledge that is expected to be easier to understand. In general, however, since there exist several possible abstractions, we have to carefully select one according to which the original database is generalized. An inadequate selection would make the accuracy of extracted knowledge worse.From this point of view, we propose in this paper a method of selecting an appropriate abstraction from possible ones, assuming that our task is to construct a decision tree from a relational database. Suppose that, for each attribute in a relational database, we have a class of possible abstractions for the attribute values. As an appropriate abstraction for each attribute, we prefer an abstraction such that, even after the abstraction, the distribution of target classes necessary to perform our classification task can be preserved within an acceptable error range given by user.By the selected abstractions, the original database can be transformed into a small generalized database written in abstract values. Therefore, it would be expected that, from the generalized database, we can construct a decision tree whose size is much smaller than one constructed from the original database. Furthermore, such a size reduction can be justified under some theoretical assumptions. The appropriateness of abstraction is precisely defined in terms of the standard information theory. Therefore, we call our abstraction framework Information Theoretical Abstraction.We show some experimental results obtained by a system ITA that is an implementation of our abstraction method. From those results, it is verified that our method is very effective in reducing the size of detected decision tree without making classification errors so worse.