GroundTruth tools & technology: applications in real world

  • Authors:
  • Vinay Saxena;Sherif Yacoub

  • Affiliations:
  • Hewlett-Packard TSG, TX;Hewlett-Packard Labs, Spain

  • Venue:
  • Proceedings of the 2005 ACM symposium on Document engineering
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

The process of creating digital archive from paper based document is gaining popularity. Automated systems/frameworks for document analysis techniques have been developed, but still lack in achieving the required accuracy goals in terms of text, article identification etc. Rendering problems, such as missing graphical components, wrong reading ordering in multi columned journals/magazine, missing indentation and broken text lines, hyphenation issues, are basically due to poor layout information extracted from the scanned document during the OCR process. Also lacking are the tools to take the output of these processes and be able to create highly accurate content with associated metadata from the original. The term "Ground Truth" in the current context is used to refer to the process (automatic and manual collectively) by which we ensure that the end result of the process are highly accurate and complete rich text content (articles, papers, etc) generated from the original scanned version of content.We present to the audience PerfectDoc - A suite of tools for manual GroundTruthing. The suite consist of tools to create highly accurate GroundTruth, GT editors and tools to take this data and deliver output suitable for web based viewing.