Data cleaning in microsoft SQL server 2005

Authors:
Surajit Chaudhuri;Kris Ganjam;Venky Ganti;Rahul Kapoor;Vivek Narasayya;Theo Vassilakis
Affiliations:
Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA;Microsoft Research, Redmond, WA
Venue:
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Year:
2005

Citing 2
Cited 8

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
A deferred cleansing method for RFID data analytics

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Adaptive graphical approach to entity resolution

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Incorporating string transformations in record matching

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Data fusion

ACM Computing Surveys (CSUR)
Declarative XML data cleaning with XClean

CAiSE'07 Proceedings of the 19th international conference on Advanced information systems engineering
Self-tuning in graph-based reference disambiguation

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Adaptive Connection Strength Models for Relationship-Based Entity Resolution

Journal of Data and Information Quality (JDIQ) - Special Issue on Entity Resolution

Quantified Score

Hi-index	0.00

Visualization

Abstract

When collecting and combining data from various sources into a data warehouse, ensuring high data quality and consistency becomes a significant, often expensive, challenge. Common data quality problems include inconsistent data conventions amongst sources such as different abbreviations or synonyms; data entry errors such as spelling mistakes; missing, incomplete, outdated or otherwise incorrect attribute values. These data defects generally manifest themselves as foreign-key mismatches and approximately duplicate records, both of which make further data mining and decision support analyses either impossible or suspect. We demonstrate two new data cleansing operators, Fuzzy Lookup and Fuzzy Grouping, which address these problems in a scalable and domain-independent manner. These operators are implemented within Microsoft SQL Server 2005 Integration Services. Our demo will explain their functionality and highlight multiple real-world scenarios in which they can be used to achieve high data quality.