Content extraction using diverse feature sets

  • Authors:
  • Matthew E. Peters;Dan Lecocq

  • Affiliations:
  • SEOmoz, Seattle, WA, USA;SEOmoz, Seattle, WA, USA

  • Venue:
  • Proceedings of the 22nd international conference on World Wide Web companion
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction that combines diverse feature sets and methods. Our main contributions are: a) preliminary results that show combining feature sets generally improves performance; and b) a method for including semantic information via id and class attributes applicable to HTML5. We also show that performance decreases on a new benchmark data set that better represents modern chrome.