Choosing subsets with maximum weighted average
Journal of Algorithms
Snowball: extracting relations from large plain-text collections
DL '00 Proceedings of the fifth ACM conference on Digital libraries
Information Retrieval
On propagation of deletions and annotations through views
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Computers and Intractability: A Guide to the Theory of NP-Completeness
Computers and Intractability: A Guide to the Theory of NP-Completeness
Building a large annotated corpus of English: the penn treebank
Computational Linguistics - Special issue on using large corpora: II
Named Entity recognition without gazetteers
EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Introduction to the CoNLL-2003 shared task: language-independent named entity recognition
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Speech and Language Processing (2nd Edition)
Speech and Language Processing (2nd Edition)
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Declarative information extraction using datalog with embedded extraction predicates
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Toward best-effort information extraction
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
SystemT: a system for declarative information extraction
ACM SIGMOD Record
An Algebraic Approach to Rule-Based Information Extraction
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficiently incorporating user feedback into information extraction and integration programs
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Provenance in Databases: Why, How, and Where
Foundations and Trends in Databases
XAR: An Integrated Framework for Information Extraction
CSIE '09 Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering - Volume 04
Methods for domain-independent information extraction from the web: an experimental comparison
AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
Bootstrapping named entity recognition with automatically generated gazetteer lists
EACL '06 Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
TextRunner: open information extraction on the web
NAACL-Demonstrations '07 Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Computing query probability with incidence algebras
Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Domain adaptation of rule-based annotators for named-entity recognition tasks
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Automatic rule refinement for information extraction
Proceedings of the VLDB Endowment
Maximizing conjunctive views in deletion propagation
Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Tracing data errors with view-conditioned causality
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Robust disambiguation of named entities in text
EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity
AI'06 Proceedings of the 19th international conference on Advances in Artificial Intelligence: Canadian Society for Computational Studies of Intelligence
Hi-index | 0.00 |
Dictionaries of terms and phrases (e.g. common person or organization names) are integral to information extraction systems that extract structured information from unstructured text. Using noisy or unrefined dictionaries may lead to many incorrect results even when highly precise and sophisticated extraction rules are used. In general, the results of the system are dependent on dictionary entries in arbitrary complex ways, and removal of a set of entries can remove both correct and incorrect results. Further, any such refinement critically requires laborious manual labeling of the results. In this paper, we study the dictionary refinement problem and address the above challenges. Using provenance of the outputs in terms of the dictionary entries, we formalize an optimization problem of maximizing the quality of the system with respect to the refined dictionaries, study complexity of this problem, and give efficient algorithms. We also propose solutions to address incomplete labeling of the results where we estimate the missing labels assuming a statistical model. We conclude with a detailed experimental evaluation using several real-world extractors and competition datasets to validate our solutions. Beyond information extraction, our provenance-based techniques and solutions may find applications in view-maintenance in general relational settings.