Provenance-based dictionary refinement in information extraction

Authors:
Sudeepa Roy;Laura Chiticariu;Vitaly Feldman;Frederick R. Reiss;Huaiyu Zhu
Affiliations:
University of Washington, Seattle, WA, USA;IBM Research-Almaden, San Jose, CA, USA;IBM Research-Almaden, San Jose, CA, USA;IBM Research-Almaden, San Jose, CA, USA;IBM Research-Almaden, San Jose, CA, USA
Venue:
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Year:
2013

Citing 29
Cited 0

Choosing subsets with maximum weighted average

Journal of Algorithms
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Information Retrieval

Information Retrieval
On propagation of deletions and annotations through views

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Computers and Intractability: A Guide to the Theory of NP-Completeness

Computers and Intractability: A Guide to the Theory of NP-Completeness
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Named Entity recognition without gazetteers

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Speech and Language Processing (2nd Edition)

Speech and Language Processing (2nd Edition)
Provenance semirings

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SystemT: a system for declarative information extraction

ACM SIGMOD Record
An Algebraic Approach to Rule-Based Information Extraction

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Provenance in Databases: Why, How, and Where

Foundations and Trends in Databases
XAR: An Integrated Framework for Information Extraction

CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 04
Methods for domain-independent information extraction from the web: an experimental comparison

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Bootstrapping named entity recognition with automatically generated gazetteer lists

EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
TextRunner: open information extraction on the web

NAACL-Demonstrations '07 Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
Computing query probability with incidence algebras

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Domain adaptation of rule-based annotators for named-entity recognition tasks

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Automatic rule refinement for information extraction

Proceedings of the VLDB Endowment
Maximizing conjunctive views in deletion propagation

Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tracing data errors with view-conditioned causality

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Robust disambiguation of named entities in text

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity

AI'06 Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.