Proceedings of the LITP spring school on theoretical computer science on Electronic dictionaries and automata in computational linguistics
Extended finite state models of language
Extended finite state models of language
Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition
The Design and Analysis of Computer Algorithms
The Design and Analysis of Computer Algorithms
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Finite-state transducer cascades to extract named entities in texts
Theoretical Computer Science - Implementation and application automata
Bootstrapping Information Extraction from Semi-structured Web Pages
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Inference of finite-state transducers from regular languages
Pattern Recognition
Towards a database for genotype-phenotype association research: mining data from encyclopaedia
International Journal of Data Mining and Bioinformatics
Hi-index | 0.00 |
The paper presents a new method for extracting information from semi-structured resources, based on finite state transducers. The method has two clearly distinguished phases. The first phase - pre-processing phase - strongly relies upon the analysis of the document structure and it is used for locating records of data in the text. The second phase is based on the finite state transducers created for extracting information. The transducers can be modified so that preferred efficiency is achieved and can be reused for extracting information from other pre-processed documents. We conclude that even untagged text can be treated as a semi-structured one, providing its structure can be successfully pre-processed. As a result, we extracted data from free form encyclopedia text and created a fully structured database with genotype and phenotype characteristics of the organisms.