Mining table information on the internet

  • Authors:
  • Sung-won Jung;Gi-deuk Han;Hyuk-chul Kwon

  • Affiliations:
  • Korean Language Processing Lab. School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab. School of Electrical & Computer Engineering, Pusan National University, Busan, Korea;Korean Language Processing Lab. School of Electrical & Computer Engineering, Pusan National University, Busan, Korea

  • Venue:
  • IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Making HTML documents, the authors use various methods for clearly conveying their intension. In those various methods, this paper pays special attention to tables because tables are commonly used within many documents to make the meanings clear, which are well recognized because web documents use tags for additional information. On the Internet, tables are used for the purpose of the knowledge structuring as well as design of documents. Thus, we are firstly interested in classifying tables into two types: meaningful tables and decorative tables. However, this is not easy because HTML does not separate presentation and structure. This paper proposes a method of extracting meaningful tables using a modified k-means and compares it with other methods. The experiment results show that classifying on web documents is promising.