Software metrics (2nd ed.): a rigorous and practical approach
Software metrics (2nd ed.): a rigorous and practical approach
Communications of the ACM
Efficient algorithms for mining outliers from large data sets
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Rule Induction with CN2: Some Recent Improvements
EWSL '91 Proceedings of the European Working Session on Machine Learning
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Experiments with Noise Filtering in a Medical Domain
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Potter's Wheel: An Interactive Data Cleaning System
Proceedings of the 27th International Conference on Very Large Data Bases
Noise Elimination in Inductive Concept Learning: A Case Study in Medical Diagnosois
ALT '96 Proceedings of the 7th International Workshop on Algorithmic Learning Theory
An Extensible Framework for Data Cleaning
ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Web Mining: Information and Pattern Discovery on the World Wide Web
ICTAI '97 Proceedings of the 9th International Conference on Tools with Artificial Intelligence
Analogy-Based Practical Classification Rules for Software Quality Estimation
Empirical Software Engineering
Analyzing Software Measurement Data with Clustering Techniques
IEEE Intelligent Systems
The Necessity of Assuring Quality in Software Measurement Data
METRICS '04 Proceedings of the Software Metrics, 10th International Symposium
Class Noise vs. Attribute Noise: A Quantitative Study
Artificial Intelligence Review
Dealing with predictive-but-unpredictable attributes in noisy data sources
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Enhancing software quality estimation using ensemble-classifier based noise filtering
Intelligent Data Analysis
Detecting noisy instances with the rule-based classification model
Intelligent Data Analysis
ICCBR'03 Proceedings of the 5th international conference on Case-based reasoning: Research and Development
Imputation techniques for multivariate missingness in software measurement data
Software Quality Control
Incomplete-case nearest neighbor imputation in software measurement data
Information Sciences: an International Journal
Hi-index | 0.00 |
A critical issue in data mining and knowledge discovery is the problem of data quality. Quantifying the presence of noise in dataset is often used as an indicator of data quality. While existing works have mostly focused on detecting class noise or mislabeling errors, very limited attention has been given to finding noisy attributes or features. Prior work in the area of noise handling has concentrated on the detection of observations that contain noise in either the attributes or class labels. Methodologies that provide insight into the quality of an attribute can provide valuable knowledge to a domain expert when data analysis is being performed. We present a novel methodology for detecting noisy attributes. The procedure utilizes our recently proposed Pairwise Attribute Noise Detection Algorithm (PANDA) for detecting instances with attribute noise. From a data analyst's point of view, our approach provides a viable solution to: "Given a dataset, which attribute(s) contains the most noise?". The proposed methodology is investigated with multiple case studies of a real-world software measurement dataset. The empirical study is investigated by injecting simulated noise into one or more attributes of a dataset that has no class noise. Based on a domain expert's inspection of the obtained results, the effectiveness of our technique for detecting noisy attributes is demonstrated.