Complete-Thread extraction from web forums

  • Authors:
  • Fanghuai Hu;Tong Ruan;Zhiqing Shao

  • Affiliations:
  • Department of Computer Science and Engineering, East China University of Science and Technology, China;Department of Computer Science and Engineering, East China University of Science and Technology, China;Department of Computer Science and Engineering, East China University of Science and Technology, China

  • Venue:
  • APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes an effective algorithm which can automatically extract all meta-information of threads from various forums. The algorithm contains two steps: thread extraction from board pages and detailed information extraction from thread pages. In the thread extraction step, the board pages are divided into five types according to their structure, and corresponding extraction algorithms and models are suggested. In the second step, an effective method is applied to identify the content of the origin post, other un-extracted fields of the origin post which are always located around the content are matched by regular patterns, and a model is trained to extract the reply posts. The experiment shows that the proposed algorithm is accurate and effective.