Distilling Informative Content from HTML News Pages

Authors:
Cai-Nicolas Ziegler;Christian Vogele;Maximilian Viermetz
Affiliations:
-;-;-
Venue:
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Year:
2009

Citing 5
Cited 0

Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Automating Content Extraction of HTML Documents

World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions

Proceedings of the 14th ACM international conference on Information and knowledge management
Towards Automated Reputation and Brand Monitoring on the Web

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Content Extraction from News Pages Using Particle Swarm Optimization on Linguistic and Structural Features

WI '07 Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

Not only the Web abounds of information overload, but also its component molecules, the Web documents contained therein. In particular HTML news pages have become aggregates of cornucopian information blocks, such as advertisements, link lists, disclaimers and terms of use, or comments from readers. Thus, only a small fraction of all textual content appears dedicated to the actual news article itself. The amalgamation of relevant content with page clutter poses considerable concerns to applications that make use of such news information, such as search engines. We present an approach geared towards dissecting relevant from irrelevant textual content in an automated fashion. Our system extracts linguistic and structural features from merged text segments and applies various classifiers thereafter. We have conducted empirical analyses in order to compare our approach's classification performance with a human gold standard as well as two benchmark systems.