Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Simple BM25 extension to multiple weighted fields
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Semantic partitioning of web pages
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Lightweight automatic face annotation in media pages
Proceedings of the 21st international conference on World Wide Web
Ranking Tagged Resources Using Social Semantic Relevance
International Journal of Information Retrieval Research
Hi-index | 0.00 |
Web pages are usually highly structured documents. In some documents, content with different functionality is laid out in blocks, some merely supporting the main discourse. In other documents, there may be several blocks of unrelated main content. Indexing a web page as if it were a linear document can cause problems because of the diverse nature of its content. If the retrieval function treats all blocks of the web page equally without attention to structure, it may lead to irrelevant query matches. In this paper, we describe how content quality of different blocks of a web page can be utilized to improve a retrieval function. Our method is based on segmenting a web page into semantically coherent blocks and learning a predictor of segment content quality. We also describe how to use segment content quality estimates as weights in the BM25F formulation. Experimental results show our method improves relevance of retrieved results by as much as 4.5% compared to BM25F that treats the body of a web page as a single section, and by a larger margin of over 9% for difficult queries.