Juicer: Scalable Extraction for Thread Meta-information of Web Forum

Authors:
Yan Guo;Yu Wang;Guodong Ding;Donglin Cao;Gang Zhang;Yi Lv
Affiliations:
Institute of Computing Technology, Chinese Academy of Sciences,;Institute of Computing Technology, Chinese Academy of Sciences,;Institute of Computing Technology, Chinese Academy of Sciences,;Institute of Computing Technology, Chinese Academy of Sciences,;Institute of Computing Technology, Chinese Academy of Sciences,;State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences,
Venue:
PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
Year:
2009

Citing 4
Cited 1

Mining data records in Web pages

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
A Survey of Web Information Extraction Systems

IEEE Transactions on Knowledge and Data Engineering
NET – a system for extracting web data from flat and nested data records

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Complete-Thread extraction from web forums

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

In Web forum, thread meta-information contained in list-of-thread of board page provide fundamental data for the further forum mining. This paper describes a complete system named Juicer which was developed as a subsystem for an industrial application that involves forum mining. The task of Juicer is to extract thread meta-information from board pages of a great many of large scale online Web forums, which implies that scalable extraction is required with high accuracy and speed, and minimal user effort for maintenance. Among so many existed approaches about information extraction, we can not find any approach to fully satisfy the requirements, so we present simple scalable extraction approach behind Juicer to achieve the goal. Juicer is constituted by four modules: Template generation, Specifying labeling setting, Automatic extraction, Label assignment. Both experiments and practice show that Juicer successfully satisfied the requirements.