Types and programming languages
Types and programming languages
Data integration: a theoretical perspective
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Physical Data Independence, Constraints, and Optimization with Universal Plans
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
A survey of approaches to automatic schema matching
The VLDB Journal — The International Journal on Very Large Data Bases
Rondo: a programming platform for generic model management
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Data exchange: semantics and query answering
Theoretical Computer Science - Database theory
Composing schema mappings: Second-order dependencies to the rescue
ACM Transactions on Database Systems (TODS) - Special Issue: SIGMOD/PODS 2004
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Pig latin: a not-so-foreign language for data processing
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
ACM Computing Surveys (CSUR)
Information extraction challenges in managing unstructured data
ACM SIGMOD Record
Large-Scale Deduplication with Constraints Using Dedupalog
ICDE '09 Proceedings of the 2009 IEEE International Conference on Data Engineering
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Clio: Schema Mapping Creation and Data Exchange
Conceptual Modeling: Foundations and Applications
SystemT: an algebraic approach to declarative information extraction
ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Next generation data analytics at IBM research
Proceedings of the VLDB Endowment
Discovering linkage points over web data
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
We introduce HIL, a high-level scripting language for entity resolution and integration. HIL aims at providing the core logic for complex data processing flows that aggregate facts from large collections of structured or unstructured data into clean, unified entities. Such flows typically include many stages of processing that start from the outcome of information extraction and continue with entity resolution, mapping and fusion. A HIL program captures the overall integration flow through a combination of SQL-like rules that link, map, fuse and aggregate entities. A salient feature of HIL is the use of logical indexes in its data model to facilitate the modular construction and aggregation of complex entities. Another feature is the presence of a flexible, open type system that allows HIL to handle input data that is irregular, sparse or partially known. As a result, HIL can accurately express complex integration tasks, while still being high-level and focused on the logical entities (rather than the physical operations). Compilation algorithms translate the HIL specification into efficient run-time queries that can execute in parallel on Hadoop. We show how our framework is applied to real-world integration of entities in the financial domain, based on public filings archived by the U.S. Securities and Exchange Commission (SEC). Furthermore, we apply HIL on a larger-scale scenario that performs fusion of data from hundreds of millions of Twitter messages into tens of millions of structured entities.