Towards understanding the functions of web element

Authors:
Xinyi Yin;Wee Sun Lee
Affiliations:
Department of Computer Science National, University of Singapore, Singapore;Department of Computer Science and Singapore-MIT Alliance, National University of Singapore, Singapore
Venue:
AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
Year:
2004

Citing 14
Cited 0

Digestor: device-independent access to the World Wide Web

Selected papers from the sixth international conference on World Wide Web
Power browser: efficient Web browsing for PDAs

Proceedings of the SIGCHI conference on Human Factors in Computing Systems
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Seeing the whole in parts: text summarization for web browsing on handheld devices

Proceedings of the 10th international conference on World Wide Web
From desktop to phonetop: a UI for web interaction on very small devices

Proceedings of the 14th annual ACM symposium on User interface software and technology
Visual Based Content Understanding towards Web Adaptation

AH '02 Proceedings of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Detecting web page structure for adaptive viewing on small form factor devices

WWW '03 Proceedings of the 12th international conference on World Wide Web
An Active Transcoding Proxy to Support Mobile Web Access

SRDS '98 Proceedings of the The 17th IEEE Symposium on Reliable Distributed Systems
HTML Page Analysis Based on Visual Cues

ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Using link analysis to improve layout on mobile devices

Proceedings of the 13th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

A web page is a collection of basic elements, and the role of each element in a page is different. For example, an image element can be part of the main content, advertisement, or banner of the site. This paper describes ongoing work using a machine learning approach to classify each element in a web page into six functional categories: Content (C), Related Link (R), Navigation (N), Advertisement (A), Form (F) and Other (O). This allows the extraction of only certain categories of content in a webpage to be delivered to a mobile device to fit user's specific needs, or to facilitate web information processes like web mining or mobile search. We manually labeled 18,864 elements from 150 websites. For each element we extracted both local features (such as the text length, URL, tag name etc) and global features (such as the text match with the other elements) to construct a feature vector. We trained the training set 10,650 elements with a decision tree learning algorithm J48, and it achieved 82% accuracy for stratified cross-validation, and an average F value 0.78 for the six different categories. Testing on 3,043 elements from pages that are not included in the training set gives 58% accuracy rate. Although this is not satisfactory overall, the F value for content category reaches 0.795, indicating that the method could be useful for less demanding applications. We are working on improving the results in order to make automatic functional classification of web elements feasible and to provide new opportunities to push the state of art in the mobile internet and mobile search.