FoCUS: learning to crawl web forums

Authors:
Jingtian Jiang;Nenghai Yu;Chin-Yew Lin
Affiliations:
University of Science and Technology of China, Hefei, China;University of Science and Technology of China, Hefei, China;Microsoft Research Asia, Beijing, China
Venue:
Proceedings of the 21st international conference companion on World Wide Web
Year:
2012

Citing 18
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Deriving marketing intelligence from online discussion

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Structured Data Extraction from the Web Based on Partial Tree Alignment

IEEE Transactions on Knowledge and Data Engineering
Do not crawl in the dust: different urls with similar text

Proceedings of the 16th international conference on World Wide Web
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Expertise networks in online communities: structure and algorithms

Proceedings of the 16th international conference on World Wide Web
Board Forum Crawling: A Web Crawling Method for Web Forum

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Finding question-answer pairs from online forums

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
De-duping URLs via rewrite rules

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Sitemaps: above and beyond the crawl of duty

Proceedings of the 18th international conference on World wide web
Learning URL patterns for webpage de-duplication

Proceedings of the third ACM international conference on Web search and data mining
Automatic extraction of web data records containing user-generated content

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present FoCUS (Forum Crawler Under Supervision), a supervised web-scale forum crawler. The goal of FoCUS is to only trawl relevant forum content from the web with minimal overhead. Forum threads contain information content that is the target of forum crawlers. Although forums have different layouts or styles and are powered by different forum software packages, they always have similar implicit navigation paths connected by specific URL types to lead users from entry pages to thread pages. Based on this observation, we reduce the web forum crawling problem to a URL type recognition problem and show how to learn accurate and effective regular expression patterns of implicit navigation paths from an automatically created training set using aggregated results from weak page type classifiers. Robust page type classifiers can be trained from as few as 5 annotated forums and applied to a large set of unseen forums. Our test results show that FoCUS achieved over 98% effectiveness and 97% coverage on a large set of test forums powered by over 150 different forum software packages.