Identifying noisy features with the Pairwise Attribute Noise Detection Algorithm

Authors:
Taghi M. Khoshgoftaar;Jason Van Hulse
Affiliations:
Florida Atlantic University, Boca Raton, Florida, FL, USA;Florida Atlantic University, Boca Raton, Florida, FL, USA
Venue:
Intelligent Data Analysis
Year:
2005

Citing 19
Cited 2

Software metrics (2nd ed.): a rigorous and practical approach

Software metrics (2nd ed.): a rigorous and practical approach
Data quality in context

Communications of the ACM
Efficient algorithms for mining outliers from large data sets

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Rule Induction with CN2: Some Recent Improvements

EWSL '91 Proceedings of the European Working Session on Machine Learning
Correcting Noisy Data

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Potter's Wheel: An Interactive Data Cleaning System

Proceedings of the 27th International Conference on Very Large Data Bases
Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois

ALT '96 Proceedings of the 7th International Workshop on Algorithmic Learning Theory
An Extensible Framework for Data Cleaning

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Web Mining: Information and Pattern Discovery on the World Wide Web

ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
Analogy-Based Practical Classification Rules for Software Quality Estimation

Empirical Software Engineering
Analyzing Software Measurement Data with Clustering Techniques

IEEE Intelligent Systems
The Necessity of Assuring Quality in Software Measurement Data

METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
Class Noise vs. Attribute Noise: A Quantitative Study

Artificial Intelligence Review
Dealing with predictive-but-unpredictable attributes in noisy data sources

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Enhancing software quality estimation using ensemble-classifier based noise filtering

Intelligent Data Analysis
Detecting noisy instances with the rule-based classification model

Intelligent Data Analysis
Detecting outliers using rule-based modeling for improving CBR-based software quality classification models

ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development

Imputation techniques for multivariate missingness in software measurement data

Software Quality Control
Incomplete-case nearest neighbor imputation in software measurement data

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

A critical issue in data mining and knowledge discovery is the problem of data quality. Quantifying the presence of noise in dataset is often used as an indicator of data quality. While existing works have mostly focused on detecting class noise or mislabeling errors, very limited attention has been given to finding noisy attributes or features. Prior work in the area of noise handling has concentrated on the detection of observations that contain noise in either the attributes or class labels. Methodologies that provide insight into the quality of an attribute can provide valuable knowledge to a domain expert when data analysis is being performed. We present a novel methodology for detecting noisy attributes. The procedure utilizes our recently proposed Pairwise Attribute Noise Detection Algorithm (PANDA) for detecting instances with attribute noise. From a data analyst's point of view, our approach provides a viable solution to: "Given a dataset, which attribute(s) contains the most noise?". The proposed methodology is investigated with multiple case studies of a real-world software measurement dataset. The empirical study is investigated by injecting simulated noise into one or more attributes of a dataset that has no class noise. Based on a domain expert's inspection of the obtained results, the effectiveness of our technique for detecting noisy attributes is demonstrated.