A real-world noisy unstructured handwritten notebook corpus for document image analysis research

Authors:
Jin Chen;Daniel Lopresti;Bart Lamiroy
Affiliations:
Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA;Nancy Université-Loria, BP, Vandoeuvre Cedex, France
Venue:
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
Year:
2011

Citing 13
Cited 0

Twenty Years of Document Image Analysis in PAMI

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Database for Handwritten Text Recognition Research

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Social Mechanism of Reputation Management in Electronic Communities

CIA '00 Proceedings of the 4th International Workshop on Cooperative Information Agents IV, The Future of Information Agents in Cyberspace
Review on Computational Trust and Reputation Models

Artificial Intelligence Review
A stroke regeneration method for cleaning rule-lines in handwritten document images

Proceedings of the International Workshop on Multilingual OCR
A Unified Framework Based on the Level Set Approach for Segmentation of Unconstrained Double-Sided Document Images Suffering from Bleed-Through

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
The GERMANA Database

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Results of the RIMES Evaluation Campaign for Handwritten Mail Processing

ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Document Ink Bleed-Through Removal with Two Hidden Markov Random Fields and a Single Observation Field

IEEE Transactions on Pattern Analysis and Machine Intelligence
IBN SINA: a database for research on processing and understanding of Arabic manuscripts images

DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
A platform for storing, visualizing, and interpreting collections of noisy documents

AND '10 Proceedings of the fourth workshop on Analytics for noisy unstructured text data
Table Detection in Noisy Off-line Handwritten Documents

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition
A Model-Based Ruling Line Detection Algorithm for Noisy Handwritten Documents

ICDAR '11 Proceedings of the 2011 International Conference on Document Analysis and Recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Traditionally, document image analysis (DIA) is conducted on datasets that are prepared for research purposes. Many existing handwriting datasets, however, do not necessarily represent the range of problems we wish to solve in real life. In this work, we introduce a noisy and unstructured handwriting dataset that aims for promoting and evaluating robust document analysis algorithms for real-world challenges, as a result of emphasizing the process of building and curating a dataset. First, we explain the data acquisition process and characterize its critical features as noisy and unstructured. Then, we discuss a set of real-world scenarios that might benefit from using our notebook dataset. As an on-going activity, so far we have collected 18 handwritten note-books from nine college students, resulting in a total of 499 pages. We expect to collect over 100 notebooks, or equivalently about 3,000 pages, from at least 50 students. This dataset is available to the research community via the Lehigh document analysis and exploitation (DAE) platform.