Efficient Information Extraction over Evolving Text Data

Authors:
Fei Chen;AnHai Doan;Jun Yang;Raghu Ramakrishnan
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;Duke University;Yahoo! Research
Venue:
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Year:
2008

Citing 0
Cited 13

Information Extraction

Foundations and Trends in Databases
High-performance information extraction with AliBaba

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
Information extraction challenges in managing unstructured data

ACM SIGMOD Record
Purple SOX extraction management system

ACM SIGMOD Record
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
I4E: interactive investigation of iterative information extraction

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Searching the web of objects

ICOODB'10 Proceedings of the Third international conference on Objects and databases
Chapter 2: next generation web search

Search Computing
INDREX: in-database distributional relation extraction

Proceedings of the sixteenth international workshop on Data warehousing and OLAP
Using semantics to process legal document updates

Proceedings of the sixth international workshop on Exploiting semantic annotations in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most current information extraction (IE) approaches have considered only static text corpora, over which we typically have to apply IE only once. Many real-world text corpora however are dynamic. They evolve over time, and to keep extracted information up to date, we often must apply IE repeatedly, to consecutive corpus snapshots. We describe Cyclex, an approach that efficiently executes such repeated IE, by recycling previous IE efforts. Specifically, given a current corpus snapshot U, Cyclex identifies text portions of U that also appear in the previous corpus snapshot V. Since Cyclex has already executed IE over V, it can now recycle the IE results of these parts, by combining these results with the results of executing IE over the remaining parts of U, to produce the complete IE results for U. Realizing Cyclex raises many challenges, including modeling information extractors, exploring the trade-off between runtime and completeness in identifying overlapping text, and making informed, cost-based decisions between redoing IE from scratch and recycling previous IE results. We describe initial solutions to these challenges, and experiments over two real-world data sets that demonstrate the utility of our approach.