Recursive X-Y cut using bounding boxes of connected components
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Named Entity recognition without gazetteers
EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Visually guided bottom-up table detection and segmentation in web documents
Proceedings of the 15th international conference on World Wide Web
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
TableSeer: automatic table metadata extraction and searching in digital libraries
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Identifying table boundaries in digital documents via sparse line detection
Proceedings of the 17th ACM conference on Information and knowledge management
Spatial Relation Based Object Extraction from the World Wide Web
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Table extraction using spatial reasoning on the CSS2 visual box model
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
An Intelligent information segmentation approach to extract financial data for business valuation
Expert Systems with Applications: An International Journal
Analysis and taxonomy of column header categories for web tables
DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
Interactive conversion of web tables
GREC'09 Proceedings of the 8th international conference on Graphics recognition: achievements, challenges, and evolution
An efficient pre-processing method to identify logical components from PDF documents
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Advanced information retrieval from web pages
FDIA'07 Proceedings of the 1st BCS IRSG conference on Future Directions in Information Access
Feature-based object identification for web automation
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Hi-index | 0.00 |
We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table element. To detect tables, we rely on a variant of the well-known X-Y cut algorithm as used in the OCR community. We implemented the system by directly accessing Mozilla's box model that contains the positional data for all HTML elements of a given web page.