DOM based content extraction via text density

Authors:
Fei Sun;Dandan Song;Lejian Liao
Affiliations:
Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China;Beijing Institute of Technology, Beijing, China
Venue:
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Year:
2011

Citing 23
Cited 4

NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
Accordion summarization for end-game browsing on PDAs and cellular phones

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Extracting context to improve accuracy for HTML content extraction

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Separating XHTML content from navigation clutter using DOM-structure block analysis

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework

Proceedings of the 15th international conference on World Wide Web
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Content Code Blurring: A New Approach to Content Extraction

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems

TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
The downside of markup: examining the harmful effects of CSS and javascript on indexing today's web

Proceedings of the 21st ACM international conference on Information and knowledge management
Webzeitgeist: design mining the web

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Content extraction using diverse feature sets

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

In addition to the main content, most web pages also contain navigation panels, advertisements and copyright and disclaimer notices. This additional content, which is also known as noise, is typically not related to the main subject and may hamper the performance of web data mining, and hence needs to be removed properly. In this paper, we present Content Extraction via Text Density (CETD) a fast, accurate and general method for extracting content from diverse web pages, and using DOM (Document Object Model) node text density to preserve the original structure. For this purpose, we introduce two concepts to measure the importance of nodes: Text Density and Composite Text Density. In order to extract content intact, we propose a technique called DensitySum to replace Data Smoothing. The approach was evaluated with the CleanEval benchmark and with randomly selected pages from well-known websites, where various web domains and styles are tested. The average F1-scores with our method were 8.79% higher than the best scores among several alternative methods.