Hybrid approach to extracting information from web-tables

Authors:
Sung-won Jung;Mi-young Kang;Hyuk-chul Kwon
Affiliations:
Korean Language Processing Laboratory, Department of Computer Science Engineering, Pusan National University;Korean Language Processing Laboratory, Department of Computer Science Engineering, Pusan National University;Korean Language Processing Laboratory, Department of Computer Science Engineering, Pusan National University
Venue:
ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Year:
2006

Citing 4
Cited 2

Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
A machine learning based approach for table detection on the web

Proceedings of the 11th international conference on World Wide Web
Mining tables from large scale HTML texts

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A Scalable Hybrid Approach for Extracting Head Components from Web Tables

IEEE Transactions on Knowledge and Data Engineering

Analysis and Interpretation of Semantic HTML Tables

WISM '09 Proceedings of the International Conference on Web Information Systems and Mining
Extracting Ontology Properties from the Web-Tables

International Journal of Systems and Service-Oriented Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This study concerns the extracting of information from tables in HTML documents. In our previous work, as a prerequisite for information extraction from tables in HTML, algorithms for separating meaningful tables and decorative tables were constructed, because only meaningful tables can be used to extract information and a preponderant proportion of decorative tables in training harms the learning result. In order to extract information, this study separated the head from the body in meaningful tables by extending the head extraction algorithm that was constructed in our previous work, using a machine learning algorithm, C4.5, and set up heuristics for table-schema extraction from meaningful tables by analyzing their head(s). In addition, table information in triples was extracted by determining the relation between the data and the extracted table schema. We obtained 71.2% accuracy in extracting table-schemata and information from the meaningful tables.