Unknown attribute values in induction
Proceedings of the sixth international workshop on Machine learning
C4.5: programs for machine learning
C4.5: programs for machine learning
Applying Bayesian networks to information retrieval
Communications of the ACM
Machine learning, neural and statistical classification
Machine learning, neural and statistical classification
Data preparation for data mining
Data preparation for data mining
Extending the database relational model to capture more meaning
ACM Transactions on Database Systems (TODS)
Communications of the ACM - Supporting community and building social capital
Simulation Modeling and Analysis
Simulation Modeling and Analysis
DIRECT: a system for mining data value conversion rules from disparate data sources
Decision Support Systems
Machine Learning
Privacy preserving mining of association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Privacy Protection in Data Mining: A Perturbation Approach for Categorical Data
Information Systems Research
Maintaining data privacy in association rule mining
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A Framework for Reconciling Attribute Values from Multiple Data Sources
Management Science
Imputation for categorical attributes with probabilistic reasoning
WAIM'13 Proceedings of the 14th international conference on Web-Age Information Management
Hi-index | 0.00 |
We propose a new approach for estimating and replacing missing categorical data. With this approach, the posterior probabilities of a missing attribute value belonging to a certain category are estimated using the simple Bayes method. Two alternative methods for replacing the missing value are proposed: The first replaces the missing value with the value having the estimated maximum probability; the second uses a value that is selected with probability proportional to the estimated posterior distribution. The effectiveness of the proposed approach is evaluated based on some important data quality measures for data warehousing and data mining. The results of the experimental study demonstrate the effectiveness of the proposed approach.