Information extraction from semi-structured resources: a two-phase finite state transducers approach

  • Authors:
  • Vesna Pajić;Gordana Pavlović Lažetić;Miloš Pajić

  • Affiliations:
  • Faculty of Agriculture, University of Belgrade, Belgrade, Republic of Serbia;Faculty of Mathematics, University of Belgrade, Belgrade, Republic of Serbia;Faculty of Agriculture, University of Belgrade, Belgrade, Republic of Serbia

  • Venue:
  • CIAA'11 Proceedings of the 16th international conference on Implementation and application of automata
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The paper presents a new method for extracting information from semi-structured resources, based on finite state transducers. The method has two clearly distinguished phases. The first phase - pre-processing phase - strongly relies upon the analysis of the document structure and it is used for locating records of data in the text. The second phase is based on the finite state transducers created for extracting information. The transducers can be modified so that preferred efficiency is achieved and can be reused for extracting information from other pre-processed documents. We conclude that even untagged text can be treated as a semi-structured one, providing its structure can be successfully pre-processed. As a result, we extracted data from free form encyclopedia text and created a fully structured database with genotype and phenotype characteristics of the organisms.