Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices
WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
IEEE Transactions on Knowledge and Data Engineering
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic Identification of Informative Sections of Web Pages
IEEE Transactions on Knowledge and Data Engineering
Proceedings of the 15th international conference on World Wide Web
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Adaptive web-page content identification
Proceedings of the 9th annual ACM international workshop on Web information and data management
Computing block importance for searching on web sites
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A graph-theoretic approach to webpage segmentation
Proceedings of the 17th international conference on World Wide Web
A densitometric approach to web page segmentation
Proceedings of the 17th ACM conference on Information and knowledge management
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
A densitometric analysis of web template content
Proceedings of the 18th international conference on World wide web
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Document structure meets page layout: loopy random fields for web news content extraction
Proceedings of the 10th ACM symposium on Document engineering
Proceedings of the 3rd International Semantic Search Workshop
Unsupervised public health event detection for epidemic intelligence
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
ECIR'11 Proceedings of the 33rd European conference on Advances in information retrieval
POWDER and the multi million-triple store
Proceedings of the International Workshop on Semantic Web Information Management
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Semantic enrichment of twitter posts for user profile construction on the social web
ESWC'11 Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II
Learning semantic relationships between entities in twitter
ICWE'11 Proceedings of the 11th international conference on Web engineering
Segmenting eBay item descriptions into coherent sections
Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data
hrWaC and slWac: compiling web corpora for Croatian and Slovene
TSD'11 Proceedings of the 14th international conference on Text, speech and dialogue
Detecting health events on the social web to enable epidemic intelligence
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
A learned approach for ranking news in real-time using the blogosphere
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Selecting Answers to Questions from Web Documents by a Robust Validation Process
WI-IAT '11 Proceedings of the 2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Proceedings of the 21st international conference on World Wide Web
An architecture-centered framework for developing blog crawlers
Proceedings of the 27th Annual ACM Symposium on Applied Computing
RetriBlog: a framework for creating blog crawlers
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Information Retrieval on the Blogosphere
Foundations and Trends in Information Retrieval
Proceedings of the Third Symposium on Information and Communication Technology
Web-Based relation extraction for the food domain
NLDB'12 Proceedings of the 17th international conference on Applications of Natural Language Processing and Information Systems
RetriBlog: An architecture-centered framework for developing blog crawlers
Expert Systems with Applications: An International Journal
A zipf-like distant supervision approach for multi-document summarization using wikinews articles
SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
On text preprocessing for opinion mining outside of laboratory environments
AMT'12 Proceedings of the 8th international conference on Active Media Technology
Computing n-gram statistics in MapReduce
Proceedings of the 16th International Conference on Extending Database Technology
Mind the gap: large-scale frequent sequence mining
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Reading the correct history?: modeling temporal intention in resource sharing
Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries
Feature-based object identification for web automation
Proceedings of the 28th Annual ACM Symposium on Applied Computing
A hybrid approach for extracting informative content from web pages
Information Processing and Management: an International Journal
Zero-cost labelling with web feeds for weblog data extraction
Proceedings of the 22nd international conference on World Wide Web companion
Content extraction using diverse feature sets
Proceedings of the 22nd international conference on World Wide Web companion
URL tree: efficient unsupervised content extraction from streams of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
How fresh do you want your search results?
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Structured positional entity language model for enterprise entity retrieval
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Self-supervised automated wrapper generation for weblog data extraction
BNCOD'13 Proceedings of the 29th British National conference on Big Data
Proceedings of the 18th Australasian Document Computing Symposium
Context Oriented Analysis of Interest Reflection of Tweeted Webpages based on Browsing Behavior
Proceedings of International Conference on Information Integration and Web-based Applications & Services
Large-scale linked data integration using probabilistic reasoning and crowdsourcing
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state-of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable detection accuracy.