TINTIN: a system for retrieval in text tables
DL '97 Proceedings of the second ACM international conference on Digital libraries
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Three Approaches to "Industrial" Table Spotting
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Mining tables from large scale HTML texts
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Learning to recognize tables in free text
ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Learning table extraction from examples
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
iASA: learning to annotate the semantic web
Journal on Data Semantics IV
Towards generic framework for tabular data extraction and management in documents
Proceedings of the sixth workshop on Ph.D. students in information and knowledge management
Hi-index | 0.00 |
Addressed in this paper is the issue of table extraction from plain text. Table is one of the commonest modes for presenting information. Table extraction has applications in information retrieval, knowledge acquisition, and text mining. Automatic information extraction from table is a challenge. Existing methods was mainly focusing on table extraction from web pages (formatted table extraction). So far the problem of table extraction on plain text, to the best of our knowledge, has not received sufficient attention. In this paper, unformatted table extraction is formalized as unformatted table block detection and unformatted table row identification. We concentrate particularly on the table extraction from Chinese documents. We propose to conduct the task of table extraction by combining machine learning methods and document structure. We first view the task as classification and propose a statistical approach to deal with it based on Naïve Bayes. We define features in the classification model. Next, we use document structure to improve the detection performance. Experimental results indicate that the proposed methods can significantly outperform the baseline methods for unformatted table extraction.