A Learning Approach to Discovering Web Page Semantic Structures

Authors:
Junlan Feng;Patrick Haffner;Mazin Gilbert
Affiliations:
AT&T LABS RESEARCH;AT&T LABS RESEARCH;AT&T LABS RESEARCH
Venue:
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Year:
2005

Citing 6
Cited 1

Support-Vector Networks

Machine Learning
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
Automatic Discovery of Semantic Structures in HTML Documents

ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
Datarover: a taxonomy based crawler for automated data extraction from data-intensive websites

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management

Identifying Semantic Constructs in Web Documents to Improve Web Site Accessibility

WISE '08 Proceedings of the 2008 international workshops on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a learning approach for discovering the semantic structure of web pages. The task includes partitioning the text on a web page into information blocks and identifying their semantic categories. We employed two machine learning techniques, Adaboost and SVMs, to learn from a labeled web page corpus. We evaluated our approach on general web pages from the World Wide Web and obtained encouraging results. This work can be beneficial to a number of web-driven applications such as search engines, web-based question answering, web-based data mining as well as voice enabled web navigation.