Content extraction using diverse feature sets

Authors:
Matthew E. Peters;Dan Lecocq
Affiliations:
SEOmoz, Seattle, WA, USA;SEOmoz, Seattle, WA, USA
Venue:
Proceedings of the 22nd international conference on World Wide Web companion
Year:
2013

Citing 4
Cited 0

Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction that combines diverse feature sets and methods. Our main contributions are: a) preliminary results that show combining feature sets generally improves performance; and b) a method for including semantic information via id and class attributes applicable to HTML5. We also show that performance decreases on a new benchmark data set that better represents modern chrome.