Making Documents Work: Challenges for Document Understanding
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Clustering document images using a bag of symbols representation
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Searching for ground truth: a stepping stone in automating genre classification
DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Using bags of symbols for automatic indexing of graphical document image databases
GREC'05 Proceedings of the 6th international conference on Graphics Recognition: ten Years Review and Future Perspectives
Genre classification in automated ingest and appraisal metadata
ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Clustering document images using graph summaries
MLDM'05 Proceedings of the 4th international conference on Machine Learning and Data Mining in Pattern Recognition
Hi-index | 0.00 |
Abstract: We approach the general problem of classifying machine-printed documents into genres. Layout is a crit-cal factor in recognizing fine-grained genres, as document content features are similar. Document genre is determined from the layout structure detected from scanned binary images of the document pages, using no OCR results and minimal a priori knowledge of document logical structures. Our method uses attributed relational graphs (ARGs) to represent the layout structure of document instances, and a first order random graphs (FORGs) to represent document genres. In this paper we develop our FORG-based genre classification method and present a comparative evaluation between our technique and a variety of statistical pattern classifiers. FORGs are capable of modeling common layout structure within a document genre and are shown to significantly outperform traditional pattern classification techniques when fine-grained genre distinctions must be drawn.