DOM based content extraction via text density

  • Authors:
  • Fei Sun;Dandan Song;Lejian Liao

  • Affiliations:
  • Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China

  • Venue:
  • Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In addition to the main content, most web pages also contain navigation panels, advertisements and copyright and disclaimer notices. This additional content, which is also known as noise, is typically not related to the main subject and may hamper the performance of web data mining, and hence needs to be removed properly. In this paper, we present Content Extraction via Text Density (CETD) a fast, accurate and general method for extracting content from diverse web pages, and using DOM (Document Object Model) node text density to preserve the original structure. For this purpose, we introduce two concepts to measure the importance of nodes: Text Density and Composite Text Density. In order to extract content intact, we propose a technique called DensitySum to replace Data Smoothing. The approach was evaluated with the CleanEval benchmark and with randomly selected pages from well-known websites, where various web domains and styles are tested. The average F1-scores with our method were 8.79% higher than the best scores among several alternative methods.