Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
Principles of database and knowledge-base systems, Vol. I
Principles of database and knowledge-base systems, Vol. I
Data integration using similarity joins and a word-based information representation language
ACM Transactions on Information Systems (TOIS)
Automatic segmentation of text into structured records
SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Introduction to Algorithms
Robust and efficient fuzzy match for online data cleaning
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Adaptive duplicate detection using learnable string similarity measures
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Record linkage: similarity measures and algorithms
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Automatically refining the wikipedia infobox ontology
Proceedings of the 17th international conference on World Wide Web
An efficient filter for approximate membership checking
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
YAGO: A Large Ontology from Wikipedia and WordNet
Web Semantics: Science, Services and Agents on the World Wide Web
Transformation-based Framework for Record Matching
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Dataspaces: a new abstraction for information management
DASFAA'06 Proceedings of the 11th international conference on Database Systems for Advanced Applications
Active knowledge: dynamically enriching RDF knowledge bases by web services
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Efficient duplicate record detection based on similarity estimation
WAIM'10 Proceedings of the 11th international conference on Web-age information management
EIF: a framework of effective entity identification
WAIM'10 Proceedings of the 11th international conference on Web-age information management
On the decidability and complexity of identity knowledge representation
DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part I
Hi-index | 0.00 |
Fundamental to data cleaning is the need to account for multiple data representations. We propose a formal framework that can be used to reason about and manipulate data representations. The framework is declarative and combines elements of a generative grammar with database querying. It also incorporates actions in the spirit of programming language compilers. This framework has multiple applications such as parsing and data normalization. Data normalization is interesting in its own right in preparing data for analysis as well as in pre-processing data for further cleansing. We empirically study the utility of the framework over several real-world data cleaning scenarios and find that with the right normalization, often the need for further cleansing is minimized.