A Learning Approach to Discovering Web Page Semantic Structures

  • Authors:
  • Junlan Feng;Patrick Haffner;Mazin Gilbert

  • Affiliations:
  • AT&T LABS RESEARCH;AT&T LABS RESEARCH;AT&T LABS RESEARCH

  • Venue:
  • ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper proposes a learning approach for discovering the semantic structure of web pages. The task includes partitioning the text on a web page into information blocks and identifying their semantic categories. We employed two machine learning techniques, Adaboost and SVMs, to learn from a labeled web page corpus. We evaluated our approach on general web pages from the World Wide Web and obtained encouraging results. This work can be beneficial to a number of web-driven applications such as search engines, web-based question answering, web-based data mining as well as voice enabled web navigation.