Information Extraction from the Web: System and Techniques

  • Authors:
  • Luo Xiao;Dieter Wissmann;Michael Brown;Stephan Jablonski

  • Affiliations:
  • CT SE 5, Siemens AG, Erlangen, Germany. Luo.Xiao@siemens.de;CT SE 5, Siemens AG, Erlangen, Germany. Dieter.Wissmann@siemens.de;Global Transactions, Ltd., Berlin, Germany. Mike@GTCT.com;Department of Computer Sciences VI, University of Erlangen-Nuremberg, Germany. Stefan.Jablonski@informatik.uni-erlangen.de

  • Venue:
  • Applied Intelligence
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Information Extraction (IE) systems that can exploit the vast source of textual information that is the internet would provide a revolutionary step forward in terms of delivering large volumes of content cheaply and precisely, thus enabling a wide range of new knowledge driven applications and services. However, despite this enormous potential, few IE systems have successfully made the transition from laboratory to commercial application. The reason may be a purely practical one—to build useable, scaleable IE systems requires bringing together a range of different technologies as well as providing clear and reproducible guidelines as to how to collectively configure and deploy those technologies.This paper is an attempt to address these issues. The paper focuses on two primary goals. Firstly, we show that an information extraction system which is used for real world applications and different domains can be built using some autonomous, corporate components (agents). Such a system has some advanced properties: clear separation to different extraction tasks and steps, portability to multiple application domain, trainability, extensibility, etc. Secondly, we show that machine learning and, in particular, learning in different ways and at different levels, can be used to build practical IE systems. We show that carefully selecting the right machine learning technique for the right task and selective sampling can be used to reduce the human effort required to annotate examples for building such systems.