Integrating geometrical and linguistic analysis for email signature block parsing
ACM Transactions on Information Systems (TOIS)
The T-Recs Table Recognition and Analysis System
DAS '98 Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice
Making Documents Work: Challenges for Document Understanding
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Layout and language: integrating spatial and linguistic knowledge for layout understanding tasks
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Hi-index | 0.00 |
We present and analyze efficient algorithms for the automated recognition and interpretation of layout structures in electronic documents. The key idea is to use the patterns in the distribution of white space in a document to recognize and interpret its components. The recognition algorithm divides the document into a hierarchy of logical elements; the interpretation algorithms classify these divisions as base-text, tables, indented lists, polygonal drawings, and graphs. We present experimental data and discuss an information access application. Our methodology allows the automatic markup of documents\footnote{For instance in the SGML format} and the creation of multi-level indices and browsing tools for electronic libraries.