Stylistic and lexical co-training for web block classification

Authors:
Chee How Lee;Min-Yen Kan;Sandra Lai
Affiliations:
National University of Singapore;National University of Singapore;National University of Singapore
Venue:
Proceedings of the 6th annual ACM international workshop on Web information and data management
Year:
2004

Citing 4
Cited 7

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
A flexible learning system for wrapping tables and lists in HTML documents

Proceedings of the 11th international conference on World Wide Web
Combining Labeled and Unlabeled Data for Text Classification with a Large Number of Categories

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Layout & language: preliminary experiments in assigning logical structure to table cells

ANLC '97 Proceedings of the fifth conference on Applied natural language processing

Blocking objectionable web content by leveraging multiple information sources

ACM SIGKDD Explorations Newsletter
A system for query-specific document summarization

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Math information retrieval: user requirements and prototype implementation

Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Looking Ahead: A Comparison of Page Preview Techniques for Goal-Directed Web Navigation

INTERACT '09 Proceedings of the 12th IFIP TC 13 International Conference on Human-Computer Interaction: Part I
Automatically documenting program changes

Proceedings of the IEEE/ACM international conference on Automated software engineering
Web classification of conceptual entities using co-training

Expert Systems with Applications: An International Journal
Turn the page: automated traversal of paginated websites

ICWE'12 Proceedings of the 12th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many applications which use web data extract information from a limited number of regions on a web page. As such, web page division into blocks and the subsequent block classification have become a preprocessing step. We introduce PARCELS, an open-source, co-trained approach that performs classification based on separate stylistic and lexical views of the web page. Unlike previous work, PARCELS performs classification on fine-grained blocks. In addition to table-based layout, the system handles real-world pages which feature layout based on divisions and spans as well as stylistic inference for pages using cascaded style sheets. Our evaluation shows that the co-training process results in a reduction of 28.5% in error rate over a single-view classifier and that our approach is comparable to other state-of-the-art systems.