Clustering document images using a bag of symbols representation

Authors:
Eugen Barbu;Pierre Heroux;Sebastien Adam;Eric Trupin
Affiliations:
CNRS FRE 2645 - Universite de Rouen, France;CNRS FRE 2645 - Universite de Rouen, France;CNRS FRE 2645 - Universite de Rouen, France;CNRS FRE 2645 - Universite de Rouen, France
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 6
Cited 5

Invariant Image Recognition by Zernike Moments

IEEE Transactions on Pattern Analysis and Machine Intelligence
Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
Algorithms for Graphics and Imag

Algorithms for Graphics and Imag
Computer Vision

Computer Vision
Fine-Grained Document Genre Classification Using First Order Random Graphs

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
State of the art of graph-based data mining

ACM SIGKDD Explorations Newsletter

A Vectorial Representation for the Indexation of Structural Informations

SSPR & SPR '08 Proceedings of the 2008 Joint IAPR International Workshop on Structural, Syntactic, and Statistical Pattern Recognition
Searching for ground truth: a stepping stone in automating genre classification

DELOS'07 Proceedings of the 1st international conference on Digital libraries: research and development
Information retrieval with a simplified conceptual graph-like representation

MICAI'10 Proceedings of the 9th Mexican international conference on Advances in artificial intelligence: Part I
Learning graph prototypes for shape recognition

Computer Vision and Image Understanding
Genre classification in automated ingest and appraisal metadata

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervised classification method, we may group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we find frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of "symbols" represents the description of a document. We present results obtained on a corpus of 60 graphical document images.We present results obtained on a corpus of 60 graphical document images.