Automatic Data Extraction from Web Discussion Forums

  • Authors:
  • Suke Li;Liyong Tang;Jianbin Hu;Zhong Chen

  • Affiliations:
  • -;-;-;-

  • Venue:
  • FCST '09 Proceedings of the 2009 Fourth International Conference on Frontier of Computer Science and Technology
  • Year:
  • 2009
  • Comparable fora

    BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents an approach to extract information from web discussion forums automatically. HTML tag paths built from a HTML DOM tree are employed to generate the post extraction template. Visual text features and HTML structure information in the same page are also combined together to extract author profile, posted date and post content automatically. Experiment results show that our approach is effective.