Automatic accuracy assessment via hashing in multiple-source environment

Authors:
Jingyu Han;Dawei Jiang;Lingjuan Li
Affiliations:
School of Computing, P.O. Box 139, Nanjing University of Posts and Telecommunications, 210003 Nanjing, China;School of Computing, National University of Singapore, Singapore 119077, Singapore;School of Computing, P.O. Box 139, Nanjing University of Posts and Telecommunications, 210003 Nanjing, China
Venue:
Expert Systems with Applications: An International Journal
Year:
2010

Citing 22
Cited 0

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Anchoring data quality dimensions in ontological foundations

Communications of the ACM
A product perspective on total data quality management

Communications of the ACM
Data quality and systems theory

Communications of the ACM
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Semantic integration of semistructured and structured data sources

ACM SIGMOD Record
Assessing data quality for information products

ICIS '99 Proceedings of the 20th international conference on Information Systems
Modern Information Retrieval

Modern Information Retrieval
Mining database structure; or, how to build a data quality browser

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
A Taxonomy of Dirty Data

Data Mining and Knowledge Discovery
Finding Interesting Associations without Support Pruning

IEEE Transactions on Knowledge and Data Engineering
Data Quality Requirements Analysis and Modeling

Proceedings of the Ninth International Conference on Data Engineering
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Managing Data Quality in Cooperative Information Systems

On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A general framework for query answering in data quality-based Cooperative Information Systems

Proceedings of the 2004 international workshop on Information quality in information systems
A framework for analysis of data freshness

Proceedings of the 2004 international workshop on Information quality in information systems
Data quality assessment from the user's perspective

Proceedings of the 2004 international workshop on Information quality in information systems
The DaQuinCIS architecture: a platform for exchanging and improving data quality in cooperative information systems

Information Systems - Special issue: Data quality in cooperative information systems
Sample-Based Quality Estimation of Query Results in Relational Database Environments

IEEE Transactions on Knowledge and Data Engineering
Quality views: capturing and exploiting the user perspective on data quality

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Discovering data quality rules

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	12.05

Visualization

Abstract

Accuracy is a most important data quality dimension and its assessment is a key issue in data management. Most of current studies focus on how to qualitatively analyze accuracy dimension and the analysis depends heavily on experts' knowledge. Seldom work is given on how to automatically quantify accuracy dimension. Based on Jensen-Shannon divergence (JSD) measure, we propose accuracy of data can be automatically quantified by comparing data with its entity's most approximation in available context. To quickly identify most approximation in large scale data sources, locality-sensitive hashing (LSH) is employed to extract most approximation at multiple levels, namely column, record and field level. Our approach can not only give each data source an objective accuracy score very quickly as long as context member is available but also avoid human's laborious interaction. As an automatic accuracy assessment solution in multiple-source environment, our approach is distinguished, especially for large scale data sources. Theory and experiment show our approach performs well in achieving metadata on accuracy dimension.