Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds

Authors:
Sandro Reichert;David Urbansky;Klemens Muthmann;Philipp Katz;Matthias Wauer;Alexander Schill
Affiliations:
Institute of Systems Architecture, Dresden, Germany;Institute of Systems Architecture, Dresden, Germany;Institute of Systems Architecture, Dresden, Germany;Institute of Systems Architecture, Dresden, Germany;Institute of Systems Architecture, Dresden, Germany;Institute of Systems Architecture, Dresden, Germany
Venue:
Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
Year:
2011

Citing 10
Cited 0

Introduction to topic detection and tracking

Topic detection and tracking
Corpora for topic detection and tracking

Topic detection and tracking
Effective page refresh policies for Web crawlers

ACM Transactions on Database Systems (TODS)
Adaptive pull-based policies for wide area data delivery

ACM Transactions on Database Systems (TODS)
Client behavior and feed characteristics of RSS, a publish-subscribe system for web micronews

IMC '05 Proceedings of the 5th ACM SIGCOMM conference on Internet Measurement
Corona: a high performance publish-subscribe system for the world wide web

NSDI'06 Proceedings of the 3rd conference on Networked Systems Design & Implementation - Volume 3
Efficient Monitoring Algorithm for Fast News Alerts

IEEE Transactions on Knowledge and Data Engineering
A new aggregation policy for RSS services

Proceedings of the 2008 international workshop on Context enabled source and service selection, integration and adaptation: organized with the 17th International World Wide Web Conference (WWW 2008)
Cobra: contentbased filtering and aggregation of blogs and RSS feeds

NSDI'07 Proceedings of the 4th USENIX conference on Networked systems design & implementation
Causal relation detection for activities from heterogeneous sources

ICWE'11 Proceedings of the 11th international conference on Current Trends in Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web feeds allow users to retrieve new content from pages on the World Wide Web. Feeds are offered by a multitude of web pages, ranging from conventional news sites to pages with user generated content such as wikis, forums or personal blogs. They notify interested readers of new content and are therefore interesting for information retrieval tasks. Unfortunately, there is no comprehensive dataset of feeds publicly available, making it difficult for researchers to work with this kind of data and, more importantly, to compare their research results by using a common dataset. In this work we present an extensive real-world dataset of 200,000 diversified feeds, as well as an analysis thereof. The dataset has been collected for a time span of four weeks, yielding over 54 million entries and 100 GB of compressed data. One important outcome of the analysis is, that feeds show different activity patterns that should be considered by aggregators, such as feed reader software, to improve polling strategies. The dataset has been made publicly available for use by research communities around the world.