TSV-LR: topological signature vector-based lexicon reduction for fast recognition of pre-modern Arabic subwords

  • Authors:
  • Youssouf Chherawala;Robert Wisnovsky;Mohamed Cheriet

  • Affiliations:
  • Synchromedia Laboratory, École de Technologie Supérieure, Montreal (QC), Canada;McGill University, Montreal (QC), Canada;Synchromedia Laboratory, École de Technologie Supérieure, Montreal (QC), Canada

  • Venue:
  • Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatic recognition of Arabic words is a challenging task and its complexity increases as the lexicon grows. In pre-modern documents, the vocabulary is unconstrained; therefore a lexicon-reduction strategy is needed to reduce the recognition computational complexity. This paper proposes a novel lexicon-reduction method for Arabic subwords based on their shapes' topology and geometry. First the sub-word shape's topological and geometrical information is extracted from its skeleton and encoded into a graph. Then the graph is converted into a topological signature vector (TSV) which preserves the graph structure. The lexicon is reduced based on the TSV distance between the lexicon sub-words' shapes and a query shape, by keeping the i nearest subwords. The value of i is selected according to a predetermined lexicon-reduction accuracy. The proposed framework has been tested on a database of pre-modern Arabic subword shapes with promising results.