Complete-Thread extraction from web forums

Authors:
Fanghuai Hu;Tong Ruan;Zhiqing Shao
Affiliations:
Department of Computer Science and Engineering, East China University of Science and Technology, China;Department of Computer Science and Engineering, East China University of Science and Technology, China;Department of Computer Science and Engineering, East China University of Science and Technology, China
Venue:
APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Year:
2012

Citing 14
Cited 0

A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Building implicit links from content for forum search

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Board Forum Crawling: A Web Crawling Method for Web Forum

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Exploring traversal strategy for web forum crawling

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Finding question-answer pairs from online forums

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Incorporating site-level knowledge to extract structured data from web forums

Proceedings of the 18th international conference on World wide web
Juicer: Scalable Extraction for Thread Meta-information of Web Forum

PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
User grouping behavior in online forums

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Template-independent wrapper for web forums

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Extracting chatbot knowledge from online discussion forums

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Learning online discussion structures by conditional random fields

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Automatic web information extraction based on rules

WISE'11 Proceedings of the 12th international conference on Web information system engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an effective algorithm which can automatically extract all meta-information of threads from various forums. The algorithm contains two steps: thread extraction from board pages and detailed information extraction from thread pages. In the thread extraction step, the board pages are divided into five types according to their structure, and corresponding extraction algorithms and models are suggested. In the second step, an effective method is applied to identify the content of the origin post, other un-extracted fields of the origin post which are always located around the content are matched by regular patterns, and a model is trained to extract the reply posts. The experiment shows that the proposed algorithm is accurate and effective.