Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
Combining Labeled and Unlabeled Data for Text Classification with a Large Number of Categories
ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Layout & language: preliminary experiments in assigning logical structure to table cells
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
Blocking objectionable web content by leveraging multiple information sources
ACM SIGKDD Explorations Newsletter
A system for query-specific document summarization
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Math information retrieval: user requirements and prototype implementation
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries
Looking Ahead: A Comparison of Page Preview Techniques for Goal-Directed Web Navigation
INTERACT '09 Proceedings of the 12th IFIP TC 13 International Conference on Human-Computer Interaction: Part I
Automatically documenting program changes
Proceedings of the IEEE/ACM international conference on Automated software engineering
Web classification of conceptual entities using co-training
Expert Systems with Applications: An International Journal
Turn the page: automated traversal of paginated websites
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Hi-index | 0.00 |
Many applications which use web data extract information from a limited number of regions on a web page. As such, web page division into blocks and the subsequent block classification have become a preprocessing step. We introduce PARCELS, an open-source, co-trained approach that performs classification based on separate stylistic and lexical views of the web page. Unlike previous work, PARCELS performs classification on fine-grained blocks. In addition to table-based layout, the system handles real-world pages which feature layout based on divisions and spans as well as stylistic inference for pages using cascaded style sheets. Our evaluation shows that the co-training process results in a reduction of 28.5% in error rate over a single-view classifier and that our approach is comparable to other state-of-the-art systems.