A machine learning based approach for separating head from body in web-tables

Authors:
Sung-Won Jung;Hyuk-Chul Kwon
Affiliations:
Korean Language Processing Lab., Department of Computer Science and Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab., Department of Computer Science and Engineering, Pusan National University, Busan, Korea
Venue:
CICLing'06 Proceedings of the 7th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2006

Citing 4
Cited 2

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A Scalable Hybrid Approach for Extracting Head Components from Web Tables

IEEE Transactions on Knowledge and Data Engineering

Analysis and Interpretation of Semantic HTML Tables

WISM '09 Proceedings of the International Conference on Web Information Systems and Mining
Adapting data table to improve web accessibility

Proceedings of the 10th International Cross-Disciplinary Conference on Web Accessibility

Quantified Score

Hi-index	0.00

Visualization

Abstract

This study aims to separate the head from the data in web-tables to extract useful information. To achieve this aim, web-tables must be converted into a machine readable form, an attribute-value pair, the relation of which is similar to that of head-body. We have separated meaningful tables and decorative tables in our previous work, because web-tables are used for the purpose of knowledge structuring as well as document design, and only meaningful tables can be used to extract information. In order to extract the semantic relations existing between language contents in a meaningful table, this study separated the head from the body in meaningful tables using machine learning. We (a) established features observing the editing habit of authors and tables themselves, and (b) established a model using machine learning algorithm, C4.5 in order to separate the head from the body. We obtained 86.2% accuracy in extracting the head from the meaningful tables.