Towards combining web classification and web information extraction: a case study

Authors:
Ping Luo;Fen Lin;Yuhong Xiong;Yong Zhao;Zhongzhi Shi
Affiliations:
HP Labs China, Beijing, China;Institute of Computing Technology, CAS, Beijing, China;HP Labs China, Beijing, China;HP Labs China, Beijing, China;Institute of Computing Technology, CAS, Beijing, China
Venue:
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2009

Citing 9
Cited 3

On the limited memory BFGS method for large scale optimization

Mathematical Programming: Series A and B
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Using Reinforcement Learning to Spider the Web Efficiently

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Probabilistic reasoning for entity & relation recognition

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
2D Conditional Random Fields for Web information extraction

ICML '05 Proceedings of the 22nd international conference on Machine learning
Information Extraction: Distilling Structured Data from Unstructured Text

Queue - Social Computing
Simultaneous record detection and attribute labeling in web data extraction

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Web page title extraction and its application

Information Processing and Management: an International Journal
Structured entity identification and document categorization: two tasks with one joint model

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

OfCourse: web content discovery, classification and information extraction for online course materials

Proceedings of the 18th ACM conference on Information and knowledge management
Towards a top-down and bottom-up bidirectional approach to joint information extraction

Proceedings of the 20th ACM international conference on Information and knowledge management
An unsupervised method for author extraction from web pages containing user-generated content

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web content analysis often has two sequential and separate steps: Web Classification to identify the target Web pages, and Web Information Extraction to extract the metadata contained in the target Web pages. This decoupled strategy is highly ineffective since the errors in Web classification will be propagated to Web information extraction and eventually accumulate to a high level. In this paper we study the mutual dependencies between these two steps and propose to combine them by using a model of Conditional Random Fields (CRFs). This model can be used to simultaneously recognize the target Web pages and extract the corresponding metadata. Systematic experiments in our project OfCourse for online course search show that this model significantly improves the F1 value for both of the two steps. We believe that our model can be easily generalized to many Web applications.