A language for specifying and comparing table recognition strategies

  • Authors:
  • Dorothea Blostein;James R. Cordy;Richard Zanibbi

  • Affiliations:
  • Queen's University at Kingston (Canada);Queen's University at Kingston (Canada);Queen's University at Kingston (Canada)

  • Venue:
  • A language for specifying and comparing table recognition strategies
  • Year:
  • 2005

Quantified Score

Hi-index 0.01

Visualization

Abstract

Table recognition algorithms may be described by models of table location and structure, and decisions made relative to these models. These algorithms are usually defined informally as a sequence of decisions with supporting data observations and transformations. In this investigation, we formalize these algorithms as strategies in an imitation game, where the goal of the game is to match table interpretations from a chosen procedure as closely as possible. The chosen procedure may be a person or persons producing ‘ground truth,’ or an algorithm. To describe table recognition strategies we have defined the Recognition Strategy Language (RSL). RSL is a simple functional language for describing strategies as sequences of abstract decision types whose results are determined by any suit able decision method. RSL defines and maintains interpretation trees, a simple data structure for describing recognition results. For each interpretation in an interpretation tree, we annotate hypothesis histories which capture the creation, revision, and rejection of individual hypotheses, such as the logical type and structure of regions. We present a proof-of-concept using two strategies from the literature. We demonstrate how RSL allows strategies to be specified at the level of decisions rather than algorithms, and we compare results of our strategy implementations using new techniques. In particular, we introduce historical recall and precision metrics. Conventional recall and precision characterize hypotheses accepted after a strategy has finished. Historical recall and precision provide additional information by describing all generated hypotheses, including any rejected in the final result.