An interactive system to extract structured text from a geometrical representation

  • Authors:
  • Benoit Poirier;Michel Dagenais

  • Affiliations:
  • -;-

  • Venue:
  • ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
  • Year:
  • 1997

Quantified Score

Hi-index 0.00

Visualization

Abstract

The proliferation of electronic document formats impedes the dissemination and management of documents. Indeed, a common format with structural information is required to obtain document indexing and navigation. While in some formats it is easy to decode and preserve the document structure information, often the only easily obtainable representation is Postscript, where only the geometrical information remains. Even if an organization is willing to convert all its document producing activities to a structure preserving format such as HTML, the existing documents need to be converted. The paper addresses the difficult problem of extracting the structure of a document from a geometrical representation. An interactive tool to extract the document content and structure from a geometric representation (Postscript) has been developed. It successfully analyzes several documents produced with different tools, and produces structural information using the HyperText Markup Language (HTML). The end user, when presented with the extracted document structure, can interactively modify it, if needed. The tool is easily extended to recognize new constructs and is aimed at organizations needing to convert numerous documents for searching and browsing on intranets or on the Internet.