An adaptive information extraction system based on wrapper induction with POS tagging

  • Authors:
  • Rinaldo Lima;Bernard Espinasse;Fred Freitas

  • Affiliations:
  • Cidade Universitária, Recife, PE, Brazil;Domaine Universitaire de St Jerôme, Marseille Cedex, France;Cidade Universitária, Recife, PE, Brazil

  • Venue:
  • Proceedings of the 2010 ACM Symposium on Applied Computing
  • Year:
  • 2010

Quantified Score

Hi-index 0.01

Visualization

Abstract

Information Extraction (IE) performs two important tasks: identifying certain pieces of information from documents and storing them for future use. This work proposes an adaptive IE system based on Boosted Wrapper Induction (BWI), a supervised wrapper induction algorithm. However, some authors have shown that boosting techniques face difficulties during the processing of natural language texts. This fact became the rationale for coupling Parts-of-Speech tagging with the BWI algorithm in our proposed system. In order to evaluate its performance, several experiments were carried out on three standard corpora. The results obtained suggest that the union of POS tagging and BWI offers a small gain of 3--5% of performance over the original BWI algorithm for unstructured texts. These results position our system among the very best similar IE systems endowed with POS tagging, according to a comparison presented and discussed in the article.