A Generalized Links and Text Properties Based Forum Crawler

  • Authors:
  • Amit Sachan;Wee-Yong Lim;Vrizlynn L. L. Thing

  • Affiliations:
  • -;-;-

  • Venue:
  • WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web forums have become a major source of information gathering/mining due to a large amount of user generated content. Crawling of web forums is necessary to gather/mine the information from them. However, a generic web crawler is unable to efficiently and effectively crawl the web forums because of the existence of many redundant and duplicate pages. In addition, there exists a crawling relationship among the useful pages that need to be considered. So, for efficient crawling, we need to intelligently crawl the web forums by eliminating redundant and duplicate pages, and understanding the crawling relationship. Existing works in forum crawling use visual pattern recognition based methods, which make them extremely computational expensive. In this paper, we propose a novel light-weight crawling method using text and links properties of the pages in web forums. Theoretical analysis and experimental results show the effectiveness and efficiency of the proposed method.