Meta-algorithmic systems for document classification

  • Authors:
  • Steven J. Simske;David W. Wright;Margaret Sturgill

  • Affiliations:
  • Hewlett-Packard Labs;Hewlett-Packard Labs;Hewlett-Packard Labs

  • Venue:
  • Proceedings of the 2006 ACM symposium on Document engineering
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

To address cost and regulatory concerns, many businesses are converting paper-based elements of their workflows into fully electronic flows that use the content of the documents. Scanning the document contents into workflows, however, is a manual, error-prone, and costly process especially when the data extraction process requires high accuracy. These manual costs are a primary barrier to widespread adoption of distributed capture solutions for business critical workflows such as insurance claims, medical records, or loan applications. Software solutions using artificial intelligence and natural language processing techniques are emerging to address these needs, but each have their individual strengths and weaknesses, and none have demonstrated a high level of accuracy across the many unstructured document types included in these business critical workflows. This paper describes how to overcome many of these limitations by intelligently combining multiple approaches for document classification using meta-algorithmic design patterns. These patterns explore the error space in multiple engines, and provide improved and "emergent" results in comparison to voting schemes and to the output of any of the individual engines. This paper considers the results of the individual engines along with traditional combinatorial techniques such as voting, before describing prototype results for a variety of novel metaalgorithmic patterns that reduce individual document error rates by up to 13% and reduce system error rates by up to 38%.