Extracting informative textual parts from web pages containing user-generated content

Authors:
Nikolaos Pappas;Georgios Katsimpras;Efstathios Stamatatos
Affiliations:
Idiap Research Institute, Rue Marconi, Martigny, Switzerland;University of the Aegean, Karlovassi, Greece;University of the Aegean, Karlovassi, Greece
Venue:
Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Year:
2012

Citing 16
Cited 1

A brief survey of web data extraction tools

ACM SIGMOD Record
Toward Learning Based Web Query Processing

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic Identification of Informative Sections of Web Pages

IEEE Transactions on Knowledge and Data Engineering
Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework

Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
DiffPost: Filtering Non-relevant Content Based on Content Difference between Two Consecutive Blog Posts

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Blog post and comment extraction using information quantity of web format

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Page segmentation by web content clustering

Proceedings of the International Conference on Web Intelligence, Mining and Semantics
Repetition-based web page segmentation by detecting tag patterns for small-screen devices

IEEE Transactions on Consumer Electronics

Distinguishing the popularity between topics: a system for up-to-date opinion retrieval and mining in the web

CICLing'13 Proceedings of the 14th international conference on Computational Linguistics and Intelligent Text Processing - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its effectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.