Learning important models for web page blocks based on layout and content analysis

Authors:
Ruihua Song;Haifeng Liu;Ji-Rong Wen;Wei-Ying Ma
Affiliations:
Microsoft Research Asia, Beijing, P.R. China;University of Toronto, Toronto, ON, Canada;Microsoft Research Asia, Beijing, P.R. China;Microsoft Research Asia, Beijing, P.R. China
Venue:
ACM SIGKDD Explorations Newsletter
Year:
2004

Citing 14
Cited 8

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Function-based object model towards website adaptation

Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Recognition of Common Areas in a Web Page Using Visual Information: a possible application in a page classification

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatic browsing of large pictures on mobile devices

MULTIMEDIA '03 Proceedings of the eleventh ACM international conference on Multimedia
Using link analysis to improve layout on mobile devices

Proceedings of the 13th international conference on World Wide Web
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Solving multiclass learning problems via error-correcting output codes

Journal of Artificial Intelligence Research
Web page cleaning for web mining through feature weighting

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Towards user-friendly mobile browsing

AAA-IDEA '06 Proceedings of the 2nd international workshop on Advanced architectures and algorithms for internet delivery and applications
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Test collection management and labeling system

Proceedings of the 9th ACM symposium on Document engineering
Deriving image-text document surrogates to optimize cognition

Proceedings of the 9th ACM symposium on Document engineering
Finding and using the content texts of HTML pages

AIRS'08 Proceedings of the 4th Asia information retrieval conference on Information retrieval technology
Layout object model for extracting the schema of web query interfaces

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
Improving semantic consistency of web sites by quantifying user intent

ICWE'05 Proceedings of the 5th international conference on Web Engineering
A model-driven methodology to the content layout problem in web applications

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Previous work shows that a web page can be partitioned into multiple segments or blocks, and often the importance of those blocks in a page is not equivalent. It has also been proven that differentiating noisy and unimportant blocks from pages can facilitate web mining, search and accessibility. However, no uniform approach and model has been presented to measure the importance of different blocks in a web page. Through a user study, we found that people do have a consistent view about the importance of blocks in a web page. Thus, we investigate how to find a model to automatically assign importance values to blocks in a web page. We formulate the block importance estimation as a learning problem. First, we use a vision-based page segmentation technique to partition a web page into semantic blocks with a hierarchical structure. Then spatial features (such as position and size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. Then, learning algorithms are used to train a model to assign importance to each block in the web page. In our experiments, the best model can achieve the performance with Micro-F1 80.2% and Micro-Accuracy 86.8%.