iRobot: an intelligent crawler for web forums

Authors:
Rui Cai;Jiang-Ming Yang;Wei Lai;Yida Wang;Lei Zhang
Affiliations:
Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China;Microsoft Research, Asia, Beijing, China
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 24
Cited 12

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Information architecture for the World Wide Web

Information architecture for the World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
Crawler-Friendly Web Servers

ACM SIGMETRICS Performance Evaluation Review
Modern Information Retrieval

Modern Information Retrieval
Introduction to Algorithms

Introduction to Algorithms
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Locality-sensitive hashing scheme based on p-stable distributions

SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Learning important models for web page blocks based on layout and content analysis

ACM SIGKDD Explorations Newsletter
User-centric Web crawling

WWW '05 Proceedings of the 14th international conference on World Wide Web
Crawling a country: better strategies than breadth-first for web page ordering

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Deriving marketing intelligence from online discussion

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Expertise networks in online communities: structure and algorithms

Proceedings of the 16th international conference on World Wide Web
Board Forum Crawling: A Web Crawling Method for Web Forum

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Joint optimization of wrapper generation and template detection

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Incorporating site-level knowledge for incremental crawling of web forums: a list-wise strategy

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic extraction rules generation based on XPath pattern learning

WISS'10 Proceedings of the 2010 international conference on Web information systems engineering
User browsing behavior-driven web crawling

Proceedings of the 20th ACM international conference on Information and knowledge management
Semi-automatic information extraction from discussion boards with applications for anti-spam technology

ICCSA'10 Proceedings of the 2010 international conference on Computational Science and Its Applications - Volume Part II
FoCUS: learning to crawl web forums

Proceedings of the 21st international conference companion on World Wide Web
Intelligent crawling of web applications for web archiving

Proceedings of the 21st international conference companion on World Wide Web
Complete-Thread extraction from web forums

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Automatically extracting user reviews from forum sites

Computers & Mathematics with Applications
A Generalized Links and Text Properties Based Forum Crawler

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Intelligent and adaptive crawling of web applications for web archiving

ICWE'13 Proceedings of the 13th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

We study in this paper the Web forum crawling problem, which is a very fundamental step in many Web applications, such as search engine and Web data mining. As a typical user-created content (UCC), Web forum has become an important resource on the Web due to its rich information contributed by millions of Internet users every day. However, Web forum crawling is not a trivial problem due to the in-depth link structures, the large amount of duplicate pages, as well as many invalid pages caused by login failure issues. In this paper, we propose and build a prototype of an intelligent forum crawler, iRobot, which has intelligence to understand the content and the structure of a forum site, and then decide how to choose traversal paths among different kinds of pages. To do this, we first randomly sample (download) a few pages from the target forum site, and introduce the page content layout as the characteristics to group those pre-sampled pages and re-construct the forum's sitemap. After that, we select an optimal crawling path which only traverses informative pages and skips invalid and duplicate ones. The extensive experimental results on several forums show the performance of our system in the following aspects: 1) Effectiveness - Compared to a generic crawler, iRobot significantly decreases the duplicate and invalid pages; 2) Efficiency - With a small cost of pre-sampling a few pages for learning the necessary knowledge, iRobot saves substantial network bandwidth and storage as it only fetches informative pages from a forum site; and 3) Long threads that are divided into multiple pages can be re-concatenated and archived as a whole thread, which is of great help for further indexing and data mining.