Identification of Embedded Mathematical Expressions in Scanned Documents

  • Authors:
  • Utpal Garain;B. B. Chaudhuri;A. Ray Chaudhuri

  • Affiliations:
  • Indian Statistical Institute, India;Indian Statistical Institute, India;Jadavpur University, India

  • Venue:
  • ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 1 - Volume 01
  • Year:
  • 2004
  • Mathematical Symbol Indexing

    AI*IA '09: Proceedings of the XIth International Conference of the Italian Association for Artificial Intelligence Reggio Emilia on Emergent Perspectives in Artificial Intelligence

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficient extraction of mathematical expressions is considered as an important pre-processing step to apply existing OCR systems to convert scientific papers into their electronic format. In this correspondence, a technique for extracting embedded (or in-line) expressions has been presented. The proposed method for expression extraction initially invokes an existing OCR to recognize the input document. Several features including word n-grams (a statistical analysis of a corpus of scientific documents reveals that the word level n-gram profile for sentences containing embedded expressions is quite different from that of the sentences without any expression) are computed on sentence level to spot sentences containing expressions. Expression zones are pin pointed by exploiting OCR inability to handle expressions and by using some common typographical aspects followed in typing mathematical expressions. Experimental results on a considerable size of dataset show high efficiency of the proposed technique.