Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
Instance-Based Learning Algorithms
Machine Learning
C4.5: programs for machine learning
C4.5: programs for machine learning
A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Fast training of support vector machines using sequential minimal optimization
Advances in kernel methods
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
A vector space model for automatic indexing
Communications of the ACM
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Extracting semantic structure of web documents using content and visual information
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A study on combination of block importance and relevance to estimate page relevance
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
A Bayesian Hierarchical Model for Learning Natural Scene Categories
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 2 - Volume 02
MULTIMEDIA '06 Proceedings of the 14th annual ACM international conference on Multimedia
Web page cleaning for web mining through feature weighting
IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Estimating continuous distributions in Bayesian classifiers
UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Granular modeling of web documents: impact on information retrieval systems
Proceedings of the 10th ACM workshop on Web information and data management
A probabilistic relational approach for web document clustering
Information Processing and Management: an International Journal
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
We present a term weighting approach for improving web page classification, based on the assumption that the images of a web page are those elements which mainly attract the attention of the user. This assumption implies that the text contained in the visual block in which an image is located, called image-block, should contain significant information about the page contents. In this paper we propose a new metric, called the Inverse Term Importance Metric, aimed at assigning higher weights to important terms contained into important image-blocks identified by performing a visual layout analysis. We propose different methods to estimate the visual image-blocks importance, to smooth the term weight according to the importance of the blocks in which the term is located. The traditional TFxIDF model is modified accordingly and used in the classification task. The effectiveness of this new metric and the proposed block evaluation methods have been validated using different classification algorithms.