Identifying auxiliary web images using combination of analyses

Authors:
Tewson Seeoun;Choochart Haruechaiyasak;Toshiaki Kondo
Affiliations:
Sirindhorn International Institute of Technology, Pathumthani, Thailand;National Electronics and Computer Technology Center, Pathumthani, Thailand;Sirindhorn International Institute of Technology, Pathumthani, Thailand
Venue:
MM '09 Proceedings of the 17th ACM international conference on Multimedia
Year:
2009

Citing 4
Cited 0

A Tutorial on Support Vector Machines for Pattern Recognition

Data Mining and Knowledge Discovery
Image classification for mobile web browsing

Proceedings of the 15th international conference on World Wide Web
Improving relevance judgment of web search results with image excerpts

Proceedings of the 17th international conference on World Wide Web
Categorizing Images in Web Documents

IEEE MultiMedia

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the Web gains more popularity, Web sites become richer in media. Besides text, another most common form of media is image. A Web page can utilize images in various ways such as to illustrate stories, to summarize data and to decorate the page. This leads to a large amount of images embedded in Web pages. However, not all Web images are informative, i.e., engaged with the page for the purpose of delivering useful information. The uninformative or auxiliary images are, for example, logos and banner advertisements. The benefit of classifying Web images as ``informative" or "auxiliary" is the efficient use of available resources. The auxiliary images are insignificant and can be ignored in many tasks including search engine's indexing, for the sake of conciseness of search results, and Web page printing, to reduce ink usage. This paper proposes a solution for the HP Multimedia Grand Challenge to identify informative multimedia contents in Web pages. Our approach is based on a supervised machine learning model trained from a set of 32 features gathered from content analysis of images, Web page layout, and domain name. We adopt the Support Vector Machines (SVM) algorithm to train the classifier. The model is optimized by a grid search technique to select the appropriate set of kernel parameters. The evaluation results based on the 10-fold cross-validation yielded the classification accuracy of 94.08%. The classification results are used to annotate each image accordingly, as in the prototype implementtaion, each image is highlighted with different border color.