Effect of OCR-errors on the transformation of semi-structured text data into relational database

Authors:
Kolyo Z. Onkov
Affiliations:
Agricultural University, Plovdiv, Bulgaria
Venue:
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Year:
2009

Citing 6
Cited 0

Semistructured data

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Mining Text Using Keyword Distributions

Journal of Intelligent Information Systems
Querying Semi-Structured Data

ICDT '97 Proceedings of the 6th International Conference on Database Theory
Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data

Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data
OCR post-processing for low density languages

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Optical character recognition errors and their effects on natural language processing

Proceedings of the second workshop on Analytics for noisy unstructured text data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Paper guides and reference books in the fields of Pharmacology, Veterinary and Crops Protection are often presented in the form of semi-structured text data. "Key words", for instance, the names of diseases and drugs, and relationships between them are of a great importance for obtaining the useful information -- advice, instructions, etc. The definition of relationships is significant problem when the aim is to transform relatively big amount semi-structured text data into intelligent computer based system. The paper shortly presents the OCR errors detection and correction in the process of transformation of Bulgarian crops protection reference book into relational database. Finally, this solution leads to substantial change in the form of the data presentation and access. This does not change the essence of the data itself.