Meta-algorithmic systems for document classification

Authors:
Steven J. Simske;David W. Wright;Margaret Sturgill
Affiliations:
Hewlett-Packard Labs;Hewlett-Packard Labs;Hewlett-Packard Labs
Venue:
Proceedings of the 2006 ACM symposium on Document engineering
Year:
2006

Citing 5
Cited 1

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A decision-theoretic generalization of on-line learning and an application to boosting

EuroCOLT '95 Proceedings of the Second European Conference on Computational Learning Theory
Performance analysis of pattern classifier combination by plurality voting

Pattern Recognition Letters
An adaptive k-nearest neighbor text categorization strategy

ACM Transactions on Asian Language Information Processing (TALIP)
Text classification with kernels on the multinomial manifold

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval

An optical character recognition approach to qualifying thresholding algorithms

Proceedings of the eighth ACM symposium on Document engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

To address cost and regulatory concerns, many businesses are converting paper-based elements of their workflows into fully electronic flows that use the content of the documents. Scanning the document contents into workflows, however, is a manual, error-prone, and costly process especially when the data extraction process requires high accuracy. These manual costs are a primary barrier to widespread adoption of distributed capture solutions for business critical workflows such as insurance claims, medical records, or loan applications. Software solutions using artificial intelligence and natural language processing techniques are emerging to address these needs, but each have their individual strengths and weaknesses, and none have demonstrated a high level of accuracy across the many unstructured document types included in these business critical workflows. This paper describes how to overcome many of these limitations by intelligently combining multiple approaches for document classification using meta-algorithmic design patterns. These patterns explore the error space in multiple engines, and provide improved and "emergent" results in comparison to voting schemes and to the output of any of the individual engines. This paper considers the results of the individual engines along with traditional combinatorial techniques such as voting, before describing prototype results for a variety of novel metaalgorithmic patterns that reduce individual document error rates by up to 13% and reduce system error rates by up to 38%.