A grammar-based entity representation framework for data cleaning

Authors:
Arvind Arasu;Raghav Kaushik
Affiliations:
Microsoft Research, Redmond, WA, USA;Microsoft Research, Redmond, WA, USA
Venue:
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Year:
2009

Citing 17
Cited 4

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
Principles of database and knowledge-base systems, Vol. I

Principles of database and knowledge-base systems, Vol. I
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Introduction to Algorithms

Introduction to Algorithms
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to extract information from semi-structured text using a discriminative context free grammar

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Automatically refining the wikipedia infobox ontology

Proceedings of the 17th international conference on World Wide Web
An efficient filter for approximate membership checking

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
YAGO: A Large Ontology from Wikipedia and WordNet

Web Semantics: Science, Services and Agents on the World Wide Web
Transformation-based Framework for Record Matching

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Dataspaces: a new abstraction for information management

DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications

Active knowledge: dynamically enriching RDF knowledge bases by web services

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient duplicate record detection based on similarity estimation

WAIM'10 Proceedings of the 11th international conference on Web-age information management
EIF: a framework of effective entity identification

WAIM'10 Proceedings of the 11th international conference on Web-age information management
On the decidability and complexity of identity knowledge representation

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing. We empirically study the utility of the framework over several real-world data cleaning scenarios and find that with the right normalization, often the need for further cleansing is minimized.