Language modelling for the needs of OCR of medical texts

Authors:
Maciej Piasecki;Grzegorz Godlewski
Affiliations:
Institute of Applied Informatics, Wrocław University of Technology, Wrocław, Poland;Institute of Applied Informatics, Wrocław University of Technology, Wrocław, Poland
Venue:
ISBMDA'06 Proceedings of the 7th international conference on Biological and Medical Data Analysis
Year:
2006

Citing 9
Cited 1

Statistical methods for speech recognition

Statistical methods for speech recognition
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Recognition of Cursive Roman Handwriting - Past, Present and Future

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
TnT: a statistical part-of-speech tagger

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Rejection Strategies for Handwritten Word Recognition

IWFHR '04 Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition
Application of syntactic properties to three-level recognition of polish hand-written medical texts

Proceedings of the 2006 ACM symposium on Document engineering
Intelligent Information Processing and Web Mining: Proceedings of the International IIS: IIPWM'06 Conference held in Ustron, Poland, June 19-22, 2006 (Advances in Soft Computing)

Intelligent Information Processing and Web Mining: Proceedings of the International IIS: IIPWM'06 Conference held in Ustron, Poland, June 19-22, 2006 (Advances in Soft Computing)
Effective architecture of the polish tagger

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue

Correction of medical handwriting OCR based on semantic similarity

IDEAL'07 Proceedings of the 8th international conference on Intelligent data engineering and automated learning

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the paper different methods of construction of language models are discussed in relation to a corpora of medical texts written in an inflective language, namely Polish. The main result is the proposal of a method of language modelling which sequentially combines tri-grams of morphological base forms with tri-grams of words. The introduction of base form tri-grams increased the overall performance of the combined model, measured as the improvement in the accuracy of OCR of handwriting, as well, as the ability to generalisation. The latter was showed by using corpora of two different types as the training one and the test one. The detailed results of tests run on a large corpora of real life medical language are discussed in the paper. An experimental system of OCR of handwritten epicrises utilising the proposed model is presented. The proposed language model decreases the overall error of the system by 64.2% (51% in the case of different types of corpora).