Distinguishing Mathematics Notation from English Text using Computational Geometry

Authors:
Derek M. Drake;Henry S. Baird
Affiliations:
Lehigh University, Bethlehem, PA, USA;Lehigh University, Bethlehem, PA, USA
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 3
Cited 1

Segmentation of page images using the area Voronoi diagram

Computer Vision and Image Understanding - Special issue on document image understanding and retrieval
Automated Segmentation of Math-Zones from Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)

A Unified Algorithm for Identification of Various Tabular Structures from Document Images

International Journal of Digital Library Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

A trainable method for distinguishing between mathematics notation and natural language (here, English) in images of textlines, using computational geometry methods only with no assistance from symbol recognition, is described. The input to our method is a "neighbor graph" extracted from a bilevel image of an isolated textline by the method of Kise [8]: this is a pruned form of Delaunay triangulation of the set of locations of black connected components. Our method first attempts to classify each vertex and, separately, each edge of the neighbor graph as belonging to math or English; then these results are combined to yield a classification of the entire textline. All three classifiers are automatically trainable. Features for the vertex and edge classifiers were selected semi-manually from a large number in a process driven by training data: this stage is potentially fully automatable. In experiments on images scanned from books and images generated synthetically, this methodology converged in three iterations to a textline classifier with an error rate of less than one percent.