Identifying primary content from web pages and its application to web search ranking

Authors:
Srinivas Vadrevu;Emre Velipasaoglu
Affiliations:
Yahoo! Labs, Sunnyvale, CA, USA;Yahoo! Labs, Sunnyvale, CA, USA
Venue:
Proceedings of the 20th international conference companion on World wide web
Year:
2011

Citing 5
Cited 2

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Simple BM25 extension to multiple weighted fields

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Semantic partitioning of web pages

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Lightweight automatic face annotation in media pages

Proceedings of the 21st international conference on World Wide Web
Ranking Tagged Resources Using Social Semantic Relevance

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web pages are usually highly structured documents. In some documents, content with different functionality is laid out in blocks, some merely supporting the main discourse. In other documents, there may be several blocks of unrelated main content. Indexing a web page as if it were a linear document can cause problems because of the diverse nature of its content. If the retrieval function treats all blocks of the web page equally without attention to structure, it may lead to irrelevant query matches. In this paper, we describe how content quality of different blocks of a web page can be utilized to improve a retrieval function. Our method is based on segmenting a web page into semantically coherent blocks and learning a predictor of segment content quality. We also describe how to use segment content quality estimates as weights in the BM25F formulation. Experimental results show our method improves relevance of retrieved results by as much as 4.5% compared to BM25F that treats the body of a web page as a single section, and by a larger margin of over 9% for difficult queries.