A Tabular Survey of Automated Table Processing
GREC '99 Selected Papers from the Third International Workshop on Graphics Recognition, Recent Advances
Precise Table Recognition by Making Use of Reference Tables
DAS '98 Selected Papers from the Third IAPR Workshop on Document Analysis Systems: Theory and Practice
A Theoretical Foundation and a Method for Document Table Structure Extraction and Decompositon
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Table Detection via Probability Optimization
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Making Documents Work: Challenges for Document Understanding
ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 2
Robust document image understanding technologies
Proceedings of the 1st ACM workshop on Hardcopy document processing
Table Detection in Online Ink Notes
IEEE Transactions on Pattern Analysis and Machine Intelligence
TableSeer: automatic table metadata extraction and searching in digital libraries
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
A statistical and combinatorial approach to text file layout inference
Journal of Computing Sciences in Colleges
Identifying table boundaries in digital documents via sparse line detection
Proceedings of the 17th ACM conference on Information and knowledge management
Non-visual navigation of spreadsheet tables
ICCHP'10 Proceedings of the 12th international conference on Computers helping people with special needs: Part I
An efficient pre-processing method to identify logical components from PDF documents
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Table detection in document images using header and trailer patterns
Proceedings of the Eighth Indian Conference on Computer Vision, Graphics and Image Processing
Ruling-based table analysis for noisy handwritten documents
Proceedings of the 4th International Workshop on Multilingual OCR
Hi-index | 0.00 |
We describe the architecture of a system for reading machine-printed documents in known predefined tabular-data layout styles. In these tables, textual data are presented in record lines made up of fixed-width fields. Tables often do not rely on line-art (ruled lines) to delimit fields, and in this way differ crucially from fixed forms. Our system performs these steps: copes with multiple tables per page; identifies records within tables; segments records into fields; and recognizes characters within fields, constrained by field-specific contextual knowledge. Obstacles to good performance on tables include small print, tight line-spacing, poor-quality text (such as photocopies), and line-art or background patterns that touch the text. Precise skew-correction and pitch-estimation, and high-performance OCR using neural nets proved crucial in overcoming these obstacles. The most significant technical advances in this work appear to be algorithms for identifying and segmenting records with known layout, and integration of these algorithms with a graphical user interface (GUI) for defining new layouts. This GUI has been ergonomically designed to make efficient and intuitive use of exemplary images, so that the skill and manual effort required to retarget the system to new table layouts are held to a minimum. The system has been applied in this way to more than 400 distinct tabular layouts. During the last three years the system has read over fifty million records with high accuracy.