Mining extremely small data sets with application to software reuse

Authors:
Yuan Jiang;Ming Li;Zhi-Hua Zhou
Affiliations:
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China;National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China;National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China
Venue:
Software—Practice & Experience
Year:
2009

Citing 19
Cited 0

Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners

IEEE Transactions on Pattern Analysis and Machine Intelligence
The nature of statistical learning theory

The nature of statistical learning theory
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Statistical and neural classifiers: an integrated approach to design

Statistical and neural classifiers: an integrated approach to design
On Fusers that Perform Better than Best Sensor

IEEE Transactions on Pattern Analysis and Machine Intelligence
Complexity Measures of Supervised Classification Problems

IEEE Transactions on Pattern Analysis and Machine Intelligence
Success and Failure Factors in Software Reuse

IEEE Transactions on Software Engineering
Random Forests

Machine Learning
Knowledge Acquisition form Examples Vis Multiple Models

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
More Success and Failure Factors in Software Reuse

IEEE Transactions on Software Engineering
Comments on "More Success and Failure Factors in Software Reuse"

IEEE Transactions on Software Engineering
Extracting symbolic rules from trained neural network ensembles

AI Communications - Special issue on Artificial intelligence advances in China
The business case for software reuse

IBM Systems Journal
Who should fix this bug?

Proceedings of the 28th international conference on Software engineering
NeC4.5: Neural Ensemble Based C4.5

IEEE Transactions on Knowledge and Data Engineering
On biases in estimating multi-valued attributes

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2
Generation of comprehensible hypotheses from gene expression data

BioDM'06 Proceedings of the 2006 international conference on Data Mining for Biomedical Applications
No free lunch theorems for optimization

IEEE Transactions on Evolutionary Computation
Medical diagnosis with C4.5 rule preceded by artificial neural network ensemble

IEEE Transactions on Information Technology in Biomedicine

Quantified Score

Hi-index	0.02

Visualization

Abstract

A serious problem encountered by machine learning and data mining techniques in software engineering is the lack of sufficient data. For example, there are only 24 examples in the current largest data set on software reuse. In this paper, a recently proposed machine learning algorithm is modified for mining extremely small data sets. This algorithm works in a twice-learning style. In detail, a random forest is trained from the original data set at first. Then, virtual examples are generated from the random forest and used to train a single decision tree. In contrast to the numerous discrepancies between the empirical data and expert opinions reported by previous research, our mining practice shows that the empirical data are actually consistent with expert opinions. Copyright © 2008 John Wiley & Sons, Ltd.