Information extraction from semi-structured resources: a two-phase finite state transducers approach

Authors:
Vesna Pajić;Gordana Pavlović Lažetić;Miloš Pajić
Affiliations:
Faculty of Agriculture, University of Belgrade, Belgrade, Republic of Serbia;Faculty of Mathematics, University of Belgrade, Belgrade, Republic of Serbia;Faculty of Agriculture, University of Belgrade, Belgrade, Republic of Serbia
Venue:
CIAA'11 Proceedings of the 16th international conference on Implementation and application of automata
Year:
2011

Citing 8
Cited 1

Proceedings of the LITP spring school on theoretical computer science on Electronic dictionaries and automata in computational linguistics

Proceedings of the LITP spring school on theoretical computer science on Electronic dictionaries and automata in computational linguistics
Extended finite state models of language

Extended finite state models of language
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
The Design and Analysis of Computer Algorithms

The Design and Analysis of Computer Algorithms
Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Finite-state transducer cascades to extract named entities in texts

Theoretical Computer Science - Implementation and application automata
Bootstrapping Information Extraction from Semi-structured Web Pages

ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Inference of finite-state transducers from regular languages

Pattern Recognition

Towards a database for genotype-phenotype association research: mining data from encyclopaedia

International Journal of Data Mining and Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

The paper presents a new method for extracting information from semi-structured resources, based on finite state transducers. The method has two clearly distinguished phases. The first phase - pre-processing phase - strongly relies upon the analysis of the document structure and it is used for locating records of data in the text. The second phase is based on the finite state transducers created for extracting information. The transducers can be modified so that preferred efficiency is achieved and can be reused for extracting information from other pre-processed documents. We conclude that even untagged text can be treated as a semi-structured one, providing its structure can be successfully pre-processed. As a result, we extracted data from free form encyclopedia text and created a fully structured database with genotype and phenotype characteristics of the organisms.