A Layout-Free Method for Extracting Elements from Document Images

Authors:
Tsukasa Kochi;Takashi Saitoh
Affiliations:
-;-
Venue:
DAS '98 Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice
Year:
1998

Citing 5
Cited 0

Font and function word identification in document recognition

Computer Vision and Image Understanding
Document Processing for Automatic Knowledge Acquisition

IEEE Transactions on Knowledge and Data Engineering
Logical Structure Analysis of Book Document Images Using Contents Information

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Automatic Acquisition of Layout Knowledge for Understanding Business Cards

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Automatic Knowledge Acquisition for Spatial Document Interpretation

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

SGML is a language for defining the layout structure of a document. Various attempts at generating SGML from a document image have not been successful. We focus on extracting some of the important layout elements by using flexible matching strategy and easy model generation. Our proposed approach treats each extracted element as it were independent. Some segmented areas like "title" or "author" are defined locally making the system robust, able to withstand shifting and noise. The system is also easy to operate. Since the system is not full automatic, we need to supply typical models of each component. Our GUI presents the attributes of each segmented area as well as the original bit map images. The color-coded attributes help us to easily edit the extracted component. In experiments with 288 pages of test images, the proposed method is shown to be 95.6% correct for a wide range of documents. By using 145 pages of documents as a learning set, the system recognized 99.2% of feature sets from 148 various types of unknown documents.