Document: a useful level for facing noisy data

Authors:
Hervé Déjean;Jean-Luc Meunier
Affiliations:
Xerox Research Centre Europe, Meylan, France;Xerox Research Centre Europe, Meylan, France
Venue:
AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Year:
2010

Citing 9
Cited 1

Automated QA for Document Understanding Systems

IEEE Software
Structuring documents according to their table of contents

Proceedings of the 2005 ACM symposium on Document engineering
Logical document conversion: combining functional and formal knowledge

Proceedings of the 2007 ACM symposium on Document engineering
Google Book Search: Document Understanding on a Massive Scale

ICDAR '07 Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02
Optical character recognition errors and their effects on natural language processing

Proceedings of the second workshop on Analytics for noisy unstructured text data
Word-Based Adaptive OCR for Historical Books

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Analysis of whole-book recognition

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Overview of the INEX 2009 book track

INEX'09 Proceedings of the Focused retrieval and evaluation, and 8th international conference on Initiative for the evaluation of XML retrieval
Towards versatile document analysis systems

DAS'06 Proceedings of the 7th international conference on Document Analysis Systems

How to carry over historic books into social networks

Proceedings of the 4th ACM workshop on Online books, complementary social media and crowdsourcing

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we will present a set of experiments using large digitalized collections of books to show that logical structures can be extracted with a good quality when working at document level. The proposed solution relies on a twofold method: first specific logical elements are recognized by a given method. Then models for the recognized elements are generated by combining layout, content and labeling information. Model inference is made possible at document level, a level which promotes frequent occurrences of document structures. These inferred models combining several kinds of information are used to correct noisy data, typically zoning, OCR and labeling errors produced by previous processing steps. This method is illustrated by the detection of two document structures: page numbers and chapter headings, two navigating elements required by digital libraries.