Detecting Tables in HTML Documents

  • Authors:
  • Yalin Wang;Jianying Hu

  • Affiliations:
  • -;-

  • Venue:
  • DAS '02 Proceedings of the 5th International Workshop on Document Analysis Systems V
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Table is a commonly used presentation scheme for describing relational information. Table understanding on the web has many potential applications including web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. Although in HTML documents tables are generally marked as 驴table驴 elements, a 驴table驴 element does not necessarily indicate the presence of a genuine relational table. Thus the important first step in table understanding in the web domain is the detection of the genuine tables. In our earlier work we designed a basic rule-based algorithm to detect genuine tables in major news and corporate home pages as part of a web content filtering system. In this paper we investigate a machine learning based approach that is trainable and thus can be automatically generalized to including any domain. Various features reflecting the layout as well as content characteristics of tables are explored. The system is tested on a large database which consists of 1, 393 HTML files collected from hundreds of different web sites from various domains and contains over 10,000 leaf 驴table驴 elements. Experiments were conducted using the cross validation method. The machine learning based approach outperformed the rule-based system and achieved an F-measure of 95.88%.