Datasets for generic relation extraction*

Authors:
B. Hachey;C. Grover;R. Tobin
Affiliations:
Language technology group, macquarie university, nsw 2109, australia email: bhachey@cmcrc.com;Informatics forum, 10 crichton street, edinburgh, eh8 9ab, scotland email: c.grover@ed.ac.uk/ r.tobin@ed.ac.uk;Informatics forum, 10 crichton street, edinburgh, eh8 9ab, scotland email: c.grover@ed.ac.uk/ r.tobin@ed.ac.uk
Venue:
Natural Language Engineering
Year:
2012

Citing 20
Cited 0

A system for discovering relationships by feature extraction from text databases

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Using linear algebra for intelligent information retrieval

SIAM Review
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Detecting and Browsing Events in Unstructured text

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting Patterns and Relations from the World Wide Web

WebDB '98 Selected papers from the International Workshop on The World Wide Web and Databases
Latent dirichlet allocation

The Journal of Machine Learning Research
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Discovery of inference rules for question-answering

Natural Language Engineering
GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data

Journal of Biomedical Informatics
Investigating GIS and smoothing for maximum entropy taggers

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Robust, applied morphological generation

INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Adaptive information extraction

ACM Computing Surveys (CSUR)
Discovering relations among named entities from large corpora

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
Simple algorithms for complex relation extraction with applications to biomedical IE

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
On-demand information extraction

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Triplify: light-weight linked data publication from relational databases

Proceedings of the 18th international conference on World wide web
Tools to address the interdependence between tokenisation and standoff annotation

NLPXML '06 Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing
Corpus design for biomedical natural language processing

ISMB '05 Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics
Comparative experiments on learning information extractors for proteins and their interactions

Artificial Intelligence in Medicine
Multi-document summarisation using generic relation extraction

EMNLP '09 Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1

Quantified Score

Hi-index	0.00

Visualization

Abstract

A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g. PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database or RDF triplestore that can be more effectively used for querying and automated reasoning. A number of resources have been developed for training and evaluating automatic systems for relation extraction in different domains. However, comparative evaluation is impeded by the fact that these corpora use different markup formats and notions of what constitutes a relation. We describe the preparation of corpora for comparative evaluation of relation extraction across domains based on the publicly available ACE 2004, ACE 2005 and BioInfer data sets. We present a common document type using token standoff and including detailed linguistic markup, while maintaining all information in the original annotation. The subsequent reannotation process normalises the two data sets so that they comply with a notion of relation that is intuitive, simple and informed by the semantic web. For the ACE data, we describe an automatic process that automatically converts many relations involving nested, nominal entity mentions to relations involving non-nested, named or pronominal entity mentions. For example, the first entity is mapped from 'one' to 'Amidu Berry' in the membership relation described in 'Amidu Berry, one half of PBS'. Moreover, we describe a comparably reannotated version of the BioInfer corpus that flattens nested relations, maps part-whole to part-part relations and maps n-ary to binary relations. Finally, we summarise experiments that compare approaches to generic relation extraction, a knowledge discovery task that uses minimally supervised techniques to achieve maximally portable extractors. These experiments illustrate the utility of the corpora.1