Mining table information on the internet

Authors:
Sung-won Jung;Gi-deuk Han;Hyuk-chul Kwon
Affiliations:
Korean Language Processing Lab. School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab. School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab. School of Electrical & Computer Engineering, Pusan National University, Busan, Korea
Venue:
IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Year:
2004

Citing 5
Cited 0

Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
Machine Learning

Machine Learning
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Effective Retrieval of Information in Tables on the Internet

IEA/AIE '02 Proceedings of the 15th international conference on Industrial and engineering applications of artificial intelligence and expert systems: developments in applied artificial intelligence
Extraction of meaningful tables from the internet using decision trees

IEA/AIE'2003 Proceedings of the 16th international conference on Developments in applied artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Making HTML documents, the authors use various methods for clearly conveying their intension. In those various methods, this paper pays special attention to tables because tables are commonly used within many documents to make the meanings clear, which are well recognized because web documents use tags for additional information. On the Internet, tables are used for the purpose of the knowledge structuring as well as design of documents. Thus, we are firstly interested in classifying tables into two types: meaningful tables and decorative tables. However, this is not easy because HTML does not separate presentation and structure. This paper proposes a method of extracting meaningful tables using a modified k-means and compares it with other methods. The experiment results show that classifying on web documents is promising.