C4.5: programs for machine learning
C4.5: programs for machine learning
Toward total data quality management (TDQM)
Information technology in action
The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Anchoring data quality dimensions in ontological foundations
Communications of the ACM
Bayesian classification (AutoClass): theory and results
Advances in knowledge discovery and data mining
Fast discovery of association rules
Advances in knowledge discovery and data mining
Communications of the ACM
Not all answers are equally good: estimating the quality of database answers
Flexible query answering systems
Duplicate record elimination in large data files
ACM Transactions on Database Systems (TODS)
Towards quality-oriented data warehouse usage and evolution
Information Systems - The 11th international conference on advanced information systems engineering (CAiSE*
Automating the approximate record-matching process
Information Sciences—Informatics and Computer Science: An International Journal
Machine Learning and Data Mining; Methods and Applications
Machine Learning and Data Mining; Methods and Applications
Information Sciences: an International Journal
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem
Data Mining and Knowledge Discovery
Enterprise Data Quality: A Pragmatic Approach
Information Systems Frontiers
A Framework for Analysis of Data Quality Research
IEEE Transactions on Knowledge and Data Engineering
Telcordia's Database Reconciliation and Data Quality Analysis Tool
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Database Security-Concepts, Approaches, and Challenges
IEEE Transactions on Dependable and Secure Computing
Middleware non-repudiation service for the data warehouse
Annales UMCS, Informatica
Hi-index | 0.00 |
Assessing and improving the quality of data stored in information systems are both important and difficult tasks. For an increasing number of companies that rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment aimed at preserving the value of those assets. For a public administration or a government, good data quality translates into good service and good relationships with the citizens. Achieving high quality standards, however, is a major task because of the variety of ways that errors might be introduced in a system, and the difficulty of correcting them in a systematic way. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among systems such as format, syntax and semantic inconsistencies. The second category is related to inconsistency with reality as it is exemplified by missing, obsolete and incorrect data values and outliers.In this paper, we describe a real-life case study on assessing and improving the quality of the data in the Italian Public Administration. The domain of study is set on taxpayer's data maintained by the Italian Ministry of Finances. In this context, we provide the Administration with a quantitative reckoning of such specific problems as record duplication and address mismatch and obsolescence, we suggest a set of guidelines for setting precise quality improvement goals, and we illustrate analysis techniques for achieving those goals. Our guidelines emphasize the importance of data flow analysis and of the definition of measurable quality indicators. The quality indicators that we propose are generic and can be used to describe a variety of data quality problems, thus representing a possible reference framework for practitioners. Finally, we investigate ways to partially automate the analysis of the causes for poor data quality.