Multi-page document analysis based on format consistency and clustering

Authors:
Liangcai Gao;Zhi Tang;Jing Fang;Xiaofan Lin
Affiliations:
Institute of Computer Science & Technology, Peking University, Beijing, 100871, China.;Institute of Computer Science & Technology, Peking University, Beijing, 100871, China.;Institute of Computer Science & Technology, Peking University, Beijing, 100871, China.;Vobile Incorporation, Santa Clara, California, 95054, USA
Venue:
International Journal of Computer Applications in Technology
Year:
2010

Citing 10
Cited 1

Data clustering: a review

ACM Computing Surveys (CSUR)
Document Processing for Automatic Knowledge Acquisition

IEEE Transactions on Knowledge and Data Engineering
Rectangle Labelling for an Invoice Understanding System

ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
WISDOM++: An Interactive and Adaptive Document Analysis System

ICDAR '99 Proceedings of the Fifth International Conference on Document Analysis and Recognition
Identifying Contents page of Documents

ICPR '96 Proceedings of the International Conference on Pattern Recognition (ICPR '96) Volume III-Volume 7276 - Volume 7276
Automated Discovery of Dependencies Between Logical Components in Document Image Understanding

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
A Segmentation Method for Bibliographic References by Contextual Tagging of Fields

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Automated Detection and Segmentation of Table of Contents Page from Document Images

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Structuring documents according to their table of contents

Proceedings of the 2005 ACM symposium on Document engineering
Logical document conversion: combining functional and formal knowledge

Proceedings of the 2007 ACM symposium on Document engineering

Semantic similarity-based PageRank using wordnet

International Journal of Computer Applications in Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

In multi-page documents, document elements belonging to the same component usually share format regularity. We call this regularity 'document component intrinsic format consistency' (DCIFC). We present a new document analysis method based on DCIFC, which is complementary to the traditional document analysis methods based on the visual characteristics of document elements. One key advantage of our method is that DCIFC is stable from document to document, and thus is not impacted by layout variability, which is a major challenge in document analysis. Our method uses clustering techniques to build statistical models and then applies the models to labelling document components. In this way, the method can adapt to specific documents using formal specificities of components. We apply our method to several document recognition tasks and show its superior performance.