Information extraction challenges in managing unstructured data

Authors:
AnHai Doan;Jeffrey F. Naughton;Raghu Ramakrishnan;Akanksha Baid;Xiaoyong Chai;Fei Chen;Ting Chen;Eric Chu;Pedro DeRose;Byron Gao;Chaitanya Gokhale;Jiansheng Huang;Warren Shen;Ba-Quy Vuong
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison
Venue:
ACM SIGMOD Record
Year:
2009

Citing 12
Cited 21

Snowball: a prototype system for extracting relations from large text collections

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Personal information management with SEMEX

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Principles of dataspace systems

Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Building structured web community portals: a top-down, compositional, and incremental approach

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A relational approach to incrementally extracting and querying structure in unstructured data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
On the provenance of non-answers to queries over extracted data

Proceedings of the VLDB Endowment
SystemT: a system for declarative information extraction

ACM SIGMOD Record
Efficient Information Extraction over Evolving Text Data

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Building Community Wikipedias: A Machine-Human Partnership Approach

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Harvesting, searching, and ranking knowledge on the web: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Enterprise information extraction: recent developments and open challenges

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
SystemT: an algebraic approach to declarative information extraction

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Find your advisor: robust knowledge gathering from the web

Procceedings of the 13th International Workshop on the Web and Databases
Searching the web of objects

ICOODB'10 Proceedings of the Third international conference on Objects and databases
Scalable knowledge harvesting with high precision and high recall

Proceedings of the fourth ACM international conference on Web search and data mining
Database researchers: plumbers or thinkers?

Proceedings of the 14th International Conference on Extending Database Technology
DIDO: a disease-determinants ontology from web sources

Proceedings of the 20th international conference companion on World wide web
Service-oriented information extraction

Proceedings of the 2011 Joint EDBT/ICDT Ph.D. Workshop
Joint unsupervised structure discovery and information extraction

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SystemT: a declarative information extraction system

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Keyword search over RDF graphs

Proceedings of the 20th ACM international conference on Information and knowledge management
Robust disambiguation of named entities in text

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Chapter 2: next generation web search

Search Computing
Chapter 3: search for knowledge

Search Computing
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia

Artificial Intelligence
HIL: a high-level scripting language for entity integration

Proceedings of the 16th International Conference on Extending Database Technology
Information extraction as a filtering task

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Over the past few years, we have been trying to build an end-to-end system at Wisconsin to manage unstructured data, using extraction, integration, and user interaction. This paper describes the key information extraction (IE) challenges that we have run into, and sketches our solutions. We discuss in particular developing a declarative IE language, optimizing for this language, generating IE provenance, incorporating user feedback into the IE process, developing a novel wiki-based user interface for feedback, best-effort IE, pushing IE into RDBMSs, and more. Our work suggests that IE in managing unstructured data can open up many interesting research challenges, and that these challenges can greatly benefit from the wealth of work on managing structured data that has been carried out by the database community.