Continuous user feedback learning for data capture from business documents

Authors:
Marcel Hanke;Klemens Muthmann;Daniel Schuster;Alexander Schill;Kamil Aliyev;Michael Berger
Affiliations:
Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;Computer Networks, Dept. of Computer Science, TU Dresden, Dresden, Germany;DocuWare AG, Germering, Germany;DocuWare AG, Germering, Germany
Venue:
HAIS'12 Proceedings of the 7th international conference on Hybrid Artificial Intelligent Systems - Volume Part II
Year:
2012

Citing 8
Cited 0

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Text clustering with extended user feedback

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Toward harnessing user feedback for machine learning

Proceedings of the 12th international conference on Intelligent user interfaces
Active Learning with Feedback on Features and Instances

The Journal of Machine Learning Research
Interacting meaningfully with machine learning systems: Three experiments

International Journal of Human-Computer Studies
Corrective feedback and persistent learning for information extraction

Artificial Intelligence
Semi-supervised classification on evolutionary data

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
End-user feature labeling: a locally-weighted regression approach

Proceedings of the 16th international conference on Intelligent user interfaces

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically processing production documents requires document type detection as well as data capture to find appropriate index data from a post-OCR representation of the document. While current learning-based methods perform quite well due to many similar documents created with the same template, their machine learning models require intense training and are hard to update frequently. We provide a method for continuously incorporating user feedback in a layout-based extraction process taking care of both immediate learning as well as limiting the size of the model. The method is evaluated on a tagged corpus of more than 5,000 business documents. It allows not only continuous re-training of the model thus adapting it to new document templates, but also starting from scratch with an empty model requiring less than 10% of the corpus as training documents to reach an accuracy measure of more than 80%.