Proceedings of the Workshop on Information Extraction Beyond The Document

  • Authors:
  • Mary Elaine Califf;Mark A. Greenwood;Mark Stevenson;Roman Yangarber

  • Affiliations:
  • Illinois State University;University of Sheffield;University of Sheffield;University of Helsinki

  • Venue:
  • IEBeyondDoc '06 Proceedings of the Workshop on Information Extraction Beyond The Document
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Traditional approaches to the development and evaluation of Information Extraction (IE) systems have relied on relatively small collections of up to a few hundred documents tagged with detailed semantic annotations. While this paradigm has enabled rapid advances in IE technology, it remains constrained by a dependence on annotated documents and does not make use of the information available in large corpora. Alternative approaches, which make use of large text collections and inter-document information, are now beginning to emerge - as evidenced by a parallel emergence of interest in learning from unlabelled data in AI in general. For example, some systems learn extraction patterns by exploiting information about their distribution across corpora; others exploit the redundancy of the Internet by assuming that facts with multiple mentions are more reliable. These approaches require large amounts of unannotated text, which is generally easy to obtain, and employ unsupervised or minimally supervised learning algorithms, as well as related techniques such as co-training and active learning. These alternative approaches are complementary to the established IE paradigm based on supervised training, and are now forming a cohesive emergent trend in recent research. They constitute the focus of this workshop.