Document image analysis for digital libraries
Proceedings of the 2006 international workshop on Research issues in digital libraries
Force deployment analysis with generalized grammar
Information Fusion
Foundations and Trends in Databases
PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence
Text versus non-text distinction in online handwritten documents
Proceedings of the 2010 ACM Symposium on Applied Computing
From layout to semantic: a reranking model for mapping web documents to mediated XML representations
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Using grammars for pattern recognition in images: A systematic review
ACM Computing Surveys (CSUR)
Hi-index | 0.00 |
We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to discriminatively select features and set all parameters in the parsing process. Therefore, and unlike many other approaches for layout analysis, ours can easily adapt itself to a variety of document analysis problems. One need only specify the page grammar and provide a set of correctly labeled pages. We apply this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation. Experiments demonstrate that the learned grammars can be used to extract the document structure in 57 files from the UWIII document image database. We also show that the same framework can be used to automatically interpret printed mathematical expressions so as to recreate the original LaTeX.