Declarative information extraction using datalog with embedded extraction predicates

Authors:
Warren Shen;AnHai Doan;Jeffrey F. Naughton;Raghu Ramakrishnan
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;Yahoo! Research
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 22
Cited 50

Optimization of queries with user-defined predicates

ACM Transactions on Database Systems (TODS)
Efficient string matching: an aid to bibliographic search

Communications of the ACM
Foundations of Databases: The Logical Level

Foundations of Databases: The Logical Level
A brief survey of web data extraction tools

ACM SIGMOD Record
The CORAL deductive system

The VLDB Journal — The International Journal on Very Large Data Bases - Prototypes of deductive database systems
Main Memory Database Systems: An Overview

IEEE Transactions on Knowledge and Data Engineering
The Volcano Optimizer Generator: Extensibility and Efficient Search

Proceedings of the Ninth International Conference on Data Engineering
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Selection conditions in main memory

ACM Transactions on Database Systems (TODS)
The deductive database system ℒ𝒟ℒ++

Theory and Practice of Logic Programming
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
Extracting relational data from HTML repositories

ACM SIGKDD Explorations Newsletter
The Lixto data extraction project: back and forth between theory and practice

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Diagnosis of asynchronous discrete event systems: datalog to the rescue!

Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Reference reconciliation in complex information spaces

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Declarative networking: language, execution and optimization

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Managing information extraction: state of the art and research directions

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Using datalog with binary decision diagrams for program analysis

APLAS'05 Proceedings of the Third Asian conference on Programming Languages and Systems

Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
On the provenance of non-answers to queries over extracted data

Proceedings of the VLDB Endowment
Evita raced: metacompilation for declarative networks

Proceedings of the VLDB Endowment
Harvesting, searching, and ranking knowledge on the web: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Database and information-retrieval methods for knowledge discovery

Communications of the ACM - A Direct Path to Dependable Software
Information Extraction

Foundations and Trends in Databases
SystemT: a system for declarative information extraction

ACM SIGMOD Record
Information extraction challenges in managing unstructured data

ACM SIGMOD Record
Purple SOX extraction management system

ACM SIGMOD Record
Building query optimizers for information extraction: the SQoUT project

ACM SIGMOD Record
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
SOFIE: a self-organizing framework for information extraction

Proceedings of the 18th international conference on World wide web
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Uncertainty management in rule-based information extraction systems

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Precise complexity analysis for efficient datalog queries

Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Find your advisor: robust knowledge gathering from the web

Procceedings of the 13th International Workshop on the Web and Databases
Domain adaptation of rule-based annotators for named-entity recognition tasks

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Is formalizing events necessary for full exploitation

ESAIR '10 Proceedings of the third workshop on Exploiting semantic annotations in information retrieval
Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment
Querying probabilistic information extraction

Proceedings of the VLDB Endowment
Scalable knowledge harvesting with high precision and high recall

Proceedings of the fourth ACM international conference on Web search and data mining
Taking the OXPath down the deep web

Proceedings of the 14th International Conference on Extending Database Technology
OXPath: little language, little memory, great value

Proceedings of the 20th international conference companion on World wide web
Hybrid in-database inference for declarative information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
More efficient datalog queries: subsumptive tabling beats magic sets

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Datalog and emerging applications: an interactive tutorial

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
A descriptive approach to classification

ICTIR'11 Proceedings of the Third international conference on Advances in information retrieval theory
Automatic extraction rules generation based on XPath pattern learning

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
Chapter 6: web data extraction for service creation

Search Computing
Intelligent crawling of web applications for web archiving

Proceedings of the 21st international conference companion on World Wide Web
Instrumenting a logic programming language to gather provenance from an information extraction application

Proceedings of the 21st international conference companion on World Wide Web
Just-in-time information extraction using extraction views

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Malleability-Aware skyline computation on linked open data

DASFAA'12 Proceedings of the 17th international conference on Database Systems for Advanced Applications - Volume Part II
A PROV encoding for provenance analysis using deductive rules

IPAW'12 Proceedings of the 4th international conference on Provenance and Annotation of Data and Processes
OXPath: A language for scalable data extraction, automation, and crawling on the deep web

The VLDB Journal — The International Journal on Very Large Data Bases
Learning to predict from textual data

Journal of Artificial Intelligence Research
Selectivity estimation for hybrid queries over text-rich data graphs

Proceedings of the 16th International Conference on Extending Database Technology
A performance comparison of parallel DBMSs and MapReduce on large-scale text analytics

Proceedings of the 16th International Conference on Extending Database Technology
GAT: Platform for automatic context-aware mobile services for m-tourism

Expert Systems with Applications: An International Journal
Provenance-based dictionary refinement in information extraction

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Automated crime report analysis and classification for e-government and decision support

Proceedings of the 14th Annual International Conference on Digital Government Research
Concept adjustment for description logics

Proceedings of the seventh international conference on Knowledge capture
Discovering influential authors in heterogeneous academic networks by a co-ranking method

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Information extraction as a filtering task

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
On the modelling of ranking algorithms in probabilistic datalog

Proceedings of the 7th International Workshop on Ranking in Databases
When speed has a price: fast information extraction using approximate algorithms

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we argue that developing information extraction (IE) programs using Datalog with embedded procedural extraction predicates is a good way to proceed. First, compared to current ad-hoc composition using, e.g., Perl or C++, Datalog provides a cleaner and more powerful way to compose small extraction modules into larger programs. Thus, writing IE programs this way retains and enhances the important advantages of current approaches: programs are easy to understand, debug, and modify. Second, once we write IE programs in this framework, we can apply query optimization techniques to them. This gives programs that, when run over a variety of data sets, are more efficient than any monolithic program because they are optimized based on the statistics of the data on which they are invoked. We show how optimizing such programs raises challenges specific to text data that cannot be accommodated in the current relational optimization framework, then provide initial solutions. Extensive experiments over real-world data demonstrate that optimization is indeed vital for IE programs and that we can effectively optimize IE programs written in this proposed framework.