Extracting semantic structure of web documents using content and visual information

Authors:
Rupesh R. Mehta;Pabitra Mitra;Harish Karnick
Affiliations:
Indian Institute of Technology, Kanpur, India;Indian Institute of Technology, Kanpur, India;Indian Institute of Technology, Kanpur, India
Venue:
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Year:
2005

Citing 3
Cited 5

Statistical Models for Text Segmentation

Machine Learning - Special issue on natural language learning
Machine Learning

Machine Learning
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Enhancing web page classification through image-block importance analysis

Information Processing and Management: an International Journal
Automatic metadata extraction from museum specimen labels

DCMI '08 Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications
Extracting the Latent Hierarchical Structure of Web Documents

Advanced Internet Based Systems and Applications
Extracting general lists from web documents: a hybrid approach

IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Towards a spatial instance learning method for deep web pages

ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects

Quantified Score

Hi-index	0.01

Visualization

Abstract

This work aims to provide a page segmentation algorithm which uses both visual and content information to extract the semantic structure of a web page. The visual information is utilized using the VIPS algorithm and the content information using a pre-trained Naive Bayes classifier. The output of the algorithm is a semantic structure tree whose leaves represent segments having unique topic. However contents of the leaf segments may possibly be physically distributed in the web page. This structure can be useful in many web applications like information retrieval, information extraction and automatic web page adaptation. This algorithm is expected to outperform other existing page segmentation algorithms since it utilizes both content and visual information.