A machine learning based approach for table detection on the web
Proceedings of the 11th international conference on World Wide Web
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Naive Bayesian Classifier Committees
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Detecting Tables in HTML Documents
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Recursive X-Y cut using bounding boxes of connected components
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Flexible Web Document Analysis for Delivery to Narrow-Bandwidth Devices
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Applying the T-Recs Table Recognition System to the Business Letter Domain
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Mining tables from large scale HTML texts
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A survey of table recognition: Models, observations, transformations, and inferences
International Journal on Document Analysis and Recognition
Learning to recognize tables in free text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Using visual cues for extraction of tabular data from arbitrary HTML documents
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Shallow parsing with conditional random fields
NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
TableSeer: automatic table metadata extraction and searching in digital libraries
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Identifying table boundaries in digital documents via sparse line detection
Proceedings of the 17th ACM conference on Information and knowledge management
Improving the Table Boundary Detection in PDFs by Fixing the Sequence Error of the Sparse Lines
ICDAR '09 Proceedings of the 2009 10th International Conference on Document Analysis and Recognition
Enhancing document structure analysis using visual analytics
Proceedings of the 2010 ACM Symposium on Applied Computing
Efficiently inducing features of conditional random fields
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
Hi-index | 0.00 |
As the rapid growth of the scientific documents in digital libraries, the search demands for the documents as well as specific components increase dramatically. Accurately detecting the component boundary is of vital importance to all the further information extraction and applications. However, document component boundary detection (especially the table, figure, and equation) is a challenging problem because there is no standardized formats and layouts across diverse documents. This paper presents an efficient document preprocessing technique to improve the document component boundary detection performance by taking advantage of the nature of document lines. Our method easily simplifies the component boundary detection problem into the sparse line analysis problem with much less noise. We define eight document line label types and apply machine learning techniques as well as the heuristic rule-based method on identifying multiple document components. Combining with different heuristic rules, we extract the multiple components in a batch way by filtering out massive noises as early as possible. Our method focus on an important un-tagged document format - PDF documents. The experimental results prove the effectiveness of the sparse line analysis.