RetriBlog: a framework for creating blog crawlers

Authors:
Rafael Ferreira;Rinaldo Lima;Jean Melo;Evandro Costa;Fred Freitas;Henrique Pacca
Affiliations:
Federal University of Pernambuco, Recife, Brazil;Federal, University of Pernambuco, Recife, Brazil;Federal University of Alagoas, Maceió, Brazil;Federal University of Alagoas, Maceió, Brazil;Federal University of Pernambuco, Recife, Brazil;Federal University of Alagoas, Maceió, Brazil
Venue:
Proceedings of the 27th Annual ACM Symposium on Applied Computing
Year:
2012

Citing 13
Cited 0

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
An algorithm for suffix stripping

Readings in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Machine Learning

Machine Learning
QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Modern Information Retrieval

Modern Information Retrieval
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Usage patterns of collaborative tagging systems

Journal of Information Science
Introduction to Information Retrieval

Introduction to Information Retrieval
Text Extraction from the Web via Text-to-Tag Ratio

DEXA '08 Proceedings of the 2008 19th International Conference on Database and Expert Systems Application
A Blog Mining Framework

IT Professional
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance

Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance

Quantified Score

Hi-index	0.00

Visualization

Abstract

Blogs are becoming an important social tool. By means of blogs, bloggers share their likes and dislikes, express their opinions, report news and form groups related to some subjects. Thus, the available information on the Blogsphere can certainly helps in the creation of interesting applications in various domains, such as e-learning, e-commerce, and e-government. However, due to the increasing number of blogs posted every day on the Web, and the dynamic nature of the Blogsphere, the tasks of collecting and extracting relevant information from blogs have become hard and time consuming. In this paper, we use techniques both from information retrieval and information extraction fields to deal with this problem. Since the blogs have many points of variability it is necessary to provide applications that can be easily adapted. We present the RetriBlog system, a framework for the development of blog crawlers dealing the variations in blogs. This paper presents the RetriBlog details and an evaluation of the proposed algorithms.