Efficient SQL-querying method for data mining in large data bases

Authors:
Nguyen Hung Son
Affiliations:
Institute of Mathematics, Warsaw University, Warsaw, Poland
Venue:
IJCAI'99 Proceedings of the 16th international joint conference on Artificial intelligence - Volume 2
Year:
1999

Citing 8
Cited 2

On changing continuous attributes into ordered discrete attributes

EWSL-91 Proceedings of the European working session on learning on Machine learning
On the Handling of Continuous-Valued Attributes in Decision Tree Generation

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
Rough Sets in Knowledge Discovery 2: Applications, Case Studies, and Software Systems

Rough Sets in Knowledge Discovery 2: Applications, Case Studies, and Software Systems
Chi2: Feature Selection and Discretization of Numeric Attributes

TAI '95 Proceedings of the Seventh International Conference on Tools with Artificial Intelligence
The attribute selection problem in decision tree generation

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence
ChiMerge: discretization of numeric attributes

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence

On Efficient Construction of Decision Trees from Large Databases

RSCTC '00 Revised Papers from the Second International Conference on Rough Sets and Current Trends in Computing
On Efficient Handling of Continuous Attributes in Large Data Bases

Fundamenta Informaticae

Quantified Score

Hi-index	0.00

Visualization

Abstract

Data mining can be understood as a process of extraction of knowledge hidden in very large data sets. Often data mining techniques (e.g. discretization or decision tree) are based on searching for an optimal partition of data with respect to some optimization criterion. In this paper, we investigate the problem of optimal binary partition of continuous attribute domain for large data sets stored in relational data bases (RDB). The critical for time complexity of algorithms solving this problem is the number of simple SQL queries like SELECT COUNT FROM ... WHERE attribute BETWEEN ... (related to some interval of attribute values) necessary to construct such partitions. We assume that the answer time for such queries does not depend on the interval length. Using straightforward approach to optimal partition selection (with respect to a given measure), the number of necessary queries is of order O(N), where N is the number of preassumed partitions of the searching space. We show some properties of considered optimization measures, that allow to reduce the size of searching space. Moreover, we prove that using only O(logiV) simple queries, one can construct the partition very close to optimal.