Continuous user feedback learning for data capture from business documents

  • Authors:
  • Marcel Hanke;Klemens Muthmann;Daniel Schuster;Alexander Schill;Kamil Aliyev;Michael Berger

  • Affiliations:
  • Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;DocuWare AG, Germering, Germany;DocuWare AG, Germering, Germany

  • Venue:
  • HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatically processing production documents requires document type detection as well as data capture to find appropriate index data from a post-OCR representation of the document. While current learning-based methods perform quite well due to many similar documents created with the same template, their machine learning models require intense training and are hard to update frequently. We provide a method for continuously incorporating user feedback in a layout-based extraction process taking care of both immediate learning as well as limiting the size of the model. The method is evaluated on a tagged corpus of more than 5,000 business documents. It allows not only continuous re-training of the model thus adapting it to new document templates, but also starting from scratch with an empty model requiring less than 10% of the corpus as training documents to reach an accuracy measure of more than 80%.