Improving Data Quality in Practice: A Case Study in the Italian Public Administration

Authors:
P. Missier;G. Lalk;V. Verykios;F. Grillo;T. Lorusso;P. Angeletti
Affiliations:
Applied Research, Telcordia Technologies, Morristown, NJ, USA;Applied Research, Telcordia Technologies, Morristown, NJ, USA;College of Information Science and Technology, Drexel University, Philadelphia, PA, USA;Italian Ministry of Finances, Italy;Italian Ministry of Finances, Italy;SO.GE.I, Roma, Italy
Venue:
Distributed and Parallel Databases
Year:
2003

Citing 17
Cited 2

C4.5: programs for machine learning

C4.5: programs for machine learning
Toward total data quality management (TDQM)

Information technology in action
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Anchoring data quality dimensions in ontological foundations

Communications of the ACM
Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Fast discovery of association rules

Advances in knowledge discovery and data mining
Examining data quality

Communications of the ACM
Not all answers are equally good: estimating the quality of database answers

Flexible query answering systems
Duplicate record elimination in large data files

ACM Transactions on Database Systems (TODS)
Towards quality-oriented data warehouse usage and evolution

Information Systems - The 11th international conference on advanced information systems engineering (CAiSE*
Automating the approximate record-matching process

Information Sciences—Informatics and Computer Science: An International Journal
Machine Learning and Data Mining; Methods and Applications

Machine Learning and Data Mining; Methods and Applications
Efficient data reconciliation

Information Sciences: an International Journal
Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Data Mining and Knowledge Discovery
Enterprise Data Quality: A Pragmatic Approach

Information Systems Frontiers
A Framework for Analysis of Data Quality Research

IEEE Transactions on Knowledge and Data Engineering
Telcordia's Database Reconciliation and Data Quality Analysis Tool

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases

Database Security-Concepts, Approaches, and Challenges

IEEE Transactions on Dependable and Secure Computing
Middleware non-repudiation service for the data warehouse

Annales UMCS, Informatica

Quantified Score

Hi-index	0.00

Visualization

Abstract

Assessing and improving the quality of data stored in information systems are both important and difficult tasks. For an increasing number of companies that rely on information as one of their most important assets, enforcing high data quality levels represents a strategic investment aimed at preserving the value of those assets. For a public administration or a government, good data quality translates into good service and good relationships with the citizens. Achieving high quality standards, however, is a major task because of the variety of ways that errors might be introduced in a system, and the difficulty of correcting them in a systematic way. Problems with data quality tend to fall into two categories. The first category is related to inconsistency among systems such as format, syntax and semantic inconsistencies. The second category is related to inconsistency with reality as it is exemplified by missing, obsolete and incorrect data values and outliers.In this paper, we describe a real-life case study on assessing and improving the quality of the data in the Italian Public Administration. The domain of study is set on taxpayer's data maintained by the Italian Ministry of Finances. In this context, we provide the Administration with a quantitative reckoning of such specific problems as record duplication and address mismatch and obsolescence, we suggest a set of guidelines for setting precise quality improvement goals, and we illustrate analysis techniques for achieving those goals. Our guidelines emphasize the importance of data flow analysis and of the definition of measurable quality indicators. The quality indicators that we propose are generic and can be used to describe a variety of data quality problems, thus representing a possible reference framework for practitioners. Finally, we investigate ways to partially automate the analysis of the causes for poor data quality.