HIL: a high-level scripting language for entity integration

Authors:
Mauricio Hernández;Georgia Koutrika;Rajasekar Krishnamurthy;Lucian Popa;Ryan Wisnesky
Affiliations:
IBM Research -- Almaden;HP Labs;IBM Research -- Almaden;IBM Research -- Almaden;Harvard University
Venue:
Proceedings of the 16th International Conference on Extending Database Technology
Year:
2013

Citing 16
Cited 2

Types and programming languages

Types and programming languages
Data integration: a theoretical perspective

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Physical Data Independence, Constraints, and Optimization with Universal Plans

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Rondo: a programming platform for generic model management

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Data exchange: semantics and query answering

Theoretical Computer Science - Database theory
Composing schema mappings: Second-order dependencies to the rescue

ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Pig latin: a not-so-foreign language for data processing

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Data fusion

ACM Computing Surveys (CSUR)
Information extraction challenges in managing unstructured data

ACM SIGMOD Record
Large-Scale Deduplication with Constraints Using Dedupalog

ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Clio: Schema Mapping Creation and Data Exchange

Conceptual Modeling: Foundations and Applications
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

Next generation data analytics at IBM research

Proceedings of the VLDB Endowment
Discovering linkage points over web data

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

We introduce HIL, a high-level scripting language for entity resolution and integration. HIL aims at providing the core logic for complex data processing flows that aggregate facts from large collections of structured or unstructured data into clean, unified entities. Such flows typically include many stages of processing that start from the outcome of information extraction and continue with entity resolution, mapping and fusion. A HIL program captures the overall integration flow through a combination of SQL-like rules that link, map, fuse and aggregate entities. A salient feature of HIL is the use of logical indexes in its data model to facilitate the modular construction and aggregation of complex entities. Another feature is the presence of a flexible, open type system that allows HIL to handle input data that is irregular, sparse or partially known. As a result, HIL can accurately express complex integration tasks, while still being high-level and focused on the logical entities (rather than the physical operations). Compilation algorithms translate the HIL specification into efficient run-time queries that can execute in parallel on Hadoop. We show how our framework is applied to real-world integration of entities in the financial domain, based on public filings archived by the U.S. Securities and Exchange Commission (SEC). Furthermore, we apply HIL on a larger-scale scenario that performs fusion of data from hundreds of millions of Twitter messages into tens of millions of structured entities.