Incremental information extraction using tree-based context representations

  • Authors:
  • Christian Siefkes

  • Affiliations:
  • Berlin-Brandenburg Graduate School in Distributed Information Systems, Database and Information Systems Group, Freie Universität Berlin, Berlin, Germany

  • Venue:
  • CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The purpose of information extraction (IE) is to find desired pieces of information in natural language texts and store them in a form that is suitable for automatic processing. Providing annotated training data to adapt a trainable IE system to a new domain requires a considerable amount of work. To address this, we explore incremental learning. Here training documents are annotated sequentially by a user and immediately incorporated into the extraction model. Thus the system can support the user by proposing extractions based on the current extraction model, reducing the workload of the user over time. We introduce an approach to modeling IE as a token classification task that allows incremental training. To provide sufficient information to the token classifiers, we use rich, tree-based context representations of each token as feature vectors. These representations make use of the heuristically deduced document structure in addition to linguistic and semantic information. We consider the resulting feature vectors as ordered and combine proximate features into more expressive joint features, called “Orthogonal Sparse Bigrams” (OSB). Our results indicate that this setup makes it possible to employ IE in an incremental fashion without a serious performance penalty.