Web-scale table census and classification

  • Authors:
  • Eric Crestan;Patrick Pantel

  • Affiliations:
  • Yahoo! Labs, Sunnyvale, CA, USA;Microsoft Research, Redmond, WA, USA

  • Venue:
  • Proceedings of the fourth ACM international conference on Web search and data mining
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

We report on a census of the types of HTML tables on the Web according to a fine-grained classification taxonomy describing the semantics that they express. For each relational table type, we describe open challenges for extracting from them semantic triples, i.e., knowledge. We also present TabEx, a supervised framework for web-scale HTML table classification and apply it to the task of classifying HTML tables into our taxonomy. We show empirical evidence, through a large-scale experimental analysis over a crawl of the Web, that classification accuracy significantly outperforms several baselines. We present a detailed feature analysis and outline the most salient features for each table type.