Boilerplate detection using shallow text features

Authors:
Christian Kohlschütter;Peter Fankhauser;Wolfgang Nejdl
Affiliations:
L3S Research Center / Leibniz Universität Hannover, Hannover, Germany;L3S Research Center / Leibniz Universität Hannover, Hannover, Germany;L3S Research Center / Leibniz Universität Hannover, Hannover, Germany
Venue:
Proceedings of the third ACM international conference on Web search and data mining
Year:
2010

Citing 16
Cited 37

Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

IEEE Transactions on Knowledge and Data Engineering
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic Identification of Informative Sections of Web Pages

IEEE Transactions on Knowledge and Data Engineering
Browsing on small screens: recasting web-page segmentation into an efficient machine learning framework

Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Adaptive web-page content identification

Proceedings of the 9th annual ACM international workshop on Web information and data management
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A graph-theoretic approach to webpage segmentation

Proceedings of the 17th international conference on World Wide Web
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
A densitometric analysis of web template content

Proceedings of the 18th international conference on World wide web
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
Dear search engine: what's your opinion about...?: sentiment analysis for semantic enrichment of web search results

Proceedings of the 3rd International Semantic Search Workshop
Unsupervised public health event detection for epidemic intelligence

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
ARES: a retrieval engine based on sentiments sentiment-based search result annotation and diversification

ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
POWDER and the multi million-triple store

Proceedings of the International Workshop on Semantic Web Information Management
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Semantic enrichment of twitter posts for user profile construction on the social web

ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Learning semantic relationships between entities in twitter

ICWE'11 Proceedings of the 11th international conference on Web engineering
Segmenting eBay item descriptions into coherent sections

Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
hrWaC and slWac: compiling web corpora for Croatian and Slovene

TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Detecting health events on the social web to enable epidemic intelligence

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A learned approach for ranking news in real-time using the blogosphere

SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Selecting Answers to Questions from Web Documents by a Robust Validation Process

WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking

Proceedings of the 21st international conference on World Wide Web
An architecture-centered framework for developing blog crawlers

Proceedings of the 27th Annual ACM Symposium on Applied Computing
RetriBlog: a framework for creating blog crawlers

Proceedings of the 27th Annual ACM Symposium on Applied Computing
Information Retrieval on the Blogosphere

Foundations and Trends in Information Retrieval
Improving Vietnamese web page clustering by combining neighbors' content and using iterative feature selection

Proceedings of the Third Symposium on Information and Communication Technology
Web-Based relation extraction for the food domain

NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
RetriBlog: An architecture-centered framework for developing blog crawlers

Expert Systems with Applications: An International Journal
A zipf-like distant supervision approach for multi-document summarization using wikinews articles

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
On text preprocessing for opinion mining outside of laboratory environments

AMT'12 Proceedings of the 8th international conference on Active Media Technology
Computing n-gram statistics in MapReduce

Proceedings of the 16th International Conference on Extending Database Technology
Mind the gap: large-scale frequent sequence mining

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Reading the correct history?: modeling temporal intention in resource sharing

Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Feature-based object identification for web automation

Proceedings of the 28th Annual ACM Symposium on Applied Computing
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal
Zero-cost labelling with web feeds for weblog data extraction

Proceedings of the 22nd international conference on World Wide Web companion
Content extraction using diverse feature sets

Proceedings of the 22nd international conference on World Wide Web companion
URL tree: efficient unsupervised content extraction from streams of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
How fresh do you want your search results?

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Structured positional entity language model for enterprise entity retrieval

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Self-supervised automated wrapper generation for weblog data extraction

BNCOD'13 Proceedings of the 29th British National conference on Big Data
Exploring the magic of WAND

Proceedings of the 18th Australasian Document Computing Symposium
Context Oriented Analysis of Interest Reflection of Tweeted Webpages based on Browsing Behavior

Proceedings of International Conference on Information Integration and Web-based Applications & Services
Large-scale linked data integration using probabilistic reasoning and crowdsourcing

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.