An information extraction engine for web discussion forums

Authors:
Hanny Yulius Limanto;Nguyen Ngoc Giang;Vo Tan Trung;Jun Zhang;Qi He;Nguyen Quang Huy
Affiliations:
Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore;Nanyang Technological University, Nanyang Avenue, Singapore
Venue:
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Year:
2005

Citing 3
Cited 3

RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Retrieving answers from frequently asked questions pages on the web

Proceedings of the 14th ACM international conference on Information and knowledge management
Web Communities Defined by Web Page Content

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Comparable fora

BUCC '11 Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this poster, we present an information extraction engine for web-based forums. The engine analyzes the HTML files crawled from web forums, deduces the wrapper (template) of the pages and extracts the information about posts (e.g., author, title, content, number of replies and views, etc.). Extraction is an important module for forum search engine, since it helps to understand the content of a forum HTML page and facilitates ranking during retrieval. We discuss the system architecture of the extraction engine in the context of a forum search engine and present various components in the extraction engine. We also introduce briefly the extraction process and discuss some implementation issues.