Class noise detection using frequent itemsets

Authors:
Jason Van Hulse;Taghi M. Khoshgoftaar
Affiliations:
Florida Atlantic University, Boca Raton, FL 33431, USA;(Correspd. Tel.: +1 561 297 3994/ Fax: +1 561 297 2800/ E-mail: taghi@cse.fau.edu) Florida Atlantic University, Boca Raton, FL 33431, USA
Venue:
Intelligent Data Analysis
Year:
2006

Citing 20
Cited 8

C4.5: programs for machine learning

C4.5: programs for machine learning
Software metrics (2nd ed.): a rigorous and practical approach

Software metrics (2nd ed.): a rigorous and practical approach
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Ordinal association rules for error identification in data sets

Proceedings of the tenth international conference on Information and knowledge management
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Classification of Fault-Prone Software Modules: Prior Probabilities,Costs, and Model Evaluation

Empirical Software Engineering
Rule Induction with CN2: Some Recent Improvements

EWSL '91 Proceedings of the European Working Session on Machine Learning
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois

ALT '96 Proceedings of the 7th International Workshop on Algorithmic Learning Theory
Web Mining: Information and Pattern Discovery on the World Wide Web

ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
Analogy-Based Practical Classification Rules for Software Quality Estimation

Empirical Software Engineering
Tree Structures for Mining Association Rules

Data Mining and Knowledge Discovery
Analyzing Software Measurement Data with Clustering Techniques

IEEE Intelligent Systems
Cost-Guided Class Noise Handling for Effective Cost-Sensitive Learning

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Class noise vs. attribute noise: a quantitative study of their impacts

Artificial Intelligence Review
Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis
Detecting outliers using rule-based modeling for improving CBR-based software quality classification models

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Identifying and eliminating mislabeled training instances

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

The multiple imputation quantitative noise corrector

Intelligent Data Analysis
Knowledge discovery from imbalanced and noisy data

Data & Knowledge Engineering
Supervised neural network modeling: an empirical investigation into learning from imbalanced data with labeling errors

IEEE Transactions on Neural Networks
An exploration of learning when data is noisy and imbalanced

Intelligent Data Analysis
Multi-view learning from imperfect tagging

Proceedings of the 20th ACM international conference on Multimedia
Mining noisy tagging from multi-label space

Proceedings of the 21st ACM international conference on Information and knowledge management
Learning with limited and noisy tagging

Proceedings of the 21st ACM international conference on Multimedia
Ensemble-based noise detection: noise ranking and visual performance evaluation

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The presence of a substantial number of noisy instances in a given dataset may adversely affect the hypothesis learnt from that data. Removing noisy instances prior to the construction of a classifier has been shown to improve the classification ability of a learner on new data. This paper introduces a novel technique for identifying observations with class noise in a dataset using frequent itemsets. For the given dataset, each instance is assigned a NoiseFactor, indicating a relative likelihood that it contains class noise. A frequent itemset is a set of instances with common attribute values which contains at least as many instances as a user-defined minimum support threshold. Consequently, the set of frequent itemsets contains information related to the structure and dependence between the attributes. Each frequent itemset is assigned a class, based on the proportion of instances within the itemset from each class. Instances that are contained in itemsets that have a large proportion of instances from the other class are identified as noisy. The technique proposed in this paper is analyzed in numerous case studies using real-world software measurement datasets with either inherent or injected noise. A comparison is provided with two well-known techniques for the identification of class noise: Classification Filter and Ensemble Filter. The results demonstrate that this new algorithm is very effective at identifying instances with class noise.