Adaptive information extraction from text by rule induction and generalisation

  • Authors:
  • Fabio Ciravegna

  • Affiliations:
  • Department of Computer Science, University of Sheffield, Sheffield, UK

  • Venue:
  • IJCAI'01 Proceedings of the 17th international joint conference on Artificial intelligence - Volume 2
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

(LP)2 is a covering algorithm for adaptive Information Extraction from text (IE). It induces symbolic rules that insert SGML tags into texts by learning from examples found in a user-defined tagged corpus. Training is performed in two steps: initially a set of tagging rules is learned; then additional rules are induced to correct mistakes and imprecision in tagging. Induction is performed by bottom-up generalization of examples in the training corpus. Shallow knowledge about Natural Language Processing (NLP) is used in the generalization process. The algorithm has a considerable success story. From a scientific point of view, experiments report excellent results with respect to the current state of the art on two publicly available corpora. From an application point of view, a successful industrial IE tool has been based on (LP)2. Real world applications have been developed and licenses have been released to external companies for building other applications. This paper presents (LP)2, experimental results and applications, and discusses the role of shallow NLP in rule induction.