An information extraction engine for web discussion forums

  • Authors:
  • Hanny Yulius Limanto;Nguyen Ngoc Giang;Vo Tan Trung;Jun Zhang;Qi He;Nguyen Quang Huy

  • Affiliations:
  • Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore

  • Venue:
  • WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this poster, we present an information extraction engine for web-based forums. The engine analyzes the HTML files crawled from web forums, deduces the wrapper (template) of the pages and extracts the information about posts (e.g., author, title, content, number of replies and views, etc.). Extraction is an important module for forum search engine, since it helps to understand the content of a forum HTML page and facilitates ranking during retrieval. We discuss the system architecture of the extraction engine in the context of a forum search engine and present various components in the extraction engine. We also introduce briefly the extraction process and discuss some implementation issues.