Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
Boilerplate detection using shallow text features
Proceedings of the third ACM international conference on Web search and data mining
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.00 |
The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertising blocks, copyright notices and the like in web pages. In this paper we explore a machine learning approach to content extraction that combines diverse feature sets and methods. Our main contributions are: a) preliminary results that show combining feature sets generally improves performance; and b) a method for including semantic information via id and class attributes applicable to HTML5. We also show that performance decreases on a new benchmark data set that better represents modern chrome.