Sample-based collection and adjustment algorithm for metadata extraction parameter of flexible format document

Authors:
Toshiko Matsumoto;Mitsuharu Oba;Takashi Onoyama
Affiliations:
Research and Development Division, Hitachi Software Engineering Co., Ltd., Tokyo, Japan;Research and Development Division, Hitachi Software Engineering Co., Ltd., Tokyo, Japan;Research and Development Division, Hitachi Software Engineering Co., Ltd., Tokyo, Japan
Venue:
ICAISC'10 Proceedings of the 10th international conference on Artifical intelligence and soft computing: Part II
Year:
2010

Citing 6
Cited 0

Extraction of data from preprinted forms

Machine Vision and Applications - Special issue: document image analysis techniques
Geometric Structure Analysis of Document Images: A Knowledge-Based Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
Machine Learning of Generalized Document Templates for Data Extraction

DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Logical Structure Analysis of Document Images Based on Emergent Computation

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Machine Learning Methods for Automatically Processing Historical Documents: From Paper Acquisition to XML Transformation

DIAL '04 Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04)
Bibliographic Meta-Data Extraction Using Probabilistic Finite State Transducers

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose an algorithm for automatically generating metadata extraction parameters. It first enumerates candidates on the basis of metadata occurrence in training documents, and then examines these candidates to avoid side effects and to maximize effectiveness. This two-stage approach enables both avoidance of exponential explosion of computation and detailed optimization. An experiment on Japanese business documents shows that an automatically generated parameter enables metadata extraction as accurately as a manually adjusted one.