A machine learning based approach for table detection on the web
Proceedings of the 11th international conference on World Wide Web
A Tutorial on Support Vector Machines for Pattern Recognition
Data Mining and Knowledge Discovery
Naive Bayesian Classifier Committees
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
ICDAR '97 Proceedings of the 4th International Conference on Document Analysis and Recognition
Detecting Tables in HTML Documents
DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
Recursive X-Y cut using bounding boxes of connected components
ICDAR '95 Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 2) - Volume 2
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Applying the T-Recs Table Recognition System to the Business Letter Domain
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Mining tables from large scale HTML texts
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A survey of table recognition: Models, observations, transformations, and inferences
International Journal on Document Analysis and Recognition
Learning to recognize tables in free text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Using visual cues for extraction of tabular data from arbitrary HTML documents
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
TableSeer: automatic table metadata extraction and searching in digital libraries
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Efficiently inducing features of conditional random fields
UAI'03 Proceedings of the Nineteenth conference on Uncertainty in Artificial Intelligence
An efficient pre-processing method to identify logical components from PDF documents
PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
Towards generic framework for tabular data extraction and management in documents
Proceedings of the sixth workshop on Ph.D. students in information and knowledge management
Hi-index | 0.00 |
Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristics-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.