A content based approach for discovering missing anchor text for web search

Authors:
Xing Yi;James Allan
Affiliations:
Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Amherst, MA, USA;Center for Intelligent Information Retrieval, Computer Science Department, University of Massachusetts, Amherst, Amherst, MA, USA
Venue:
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Year:
2010

Citing 17
Cited 2

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Relevance based language models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Cumulated gain-based evaluation of IR techniques

ACM Transactions on Information Systems (TOIS)
Combining document representations for known-item search

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
An Overview of the INQUERY System as Used for the TIPSTER Project

An Overview of the INQUERY System as Used for the TIPSTER Project
Relevant query feedback in statistical language modeling

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Cluster-based retrieval using language models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus structure, language models, and ad hoc information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Learning to rank using gradient descent

ICML '05 Proceedings of the 22nd international conference on Machine learning
Respect my authority!: HITS without hyperlinks, utilizing cluster-based language models

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Language model information retrieval with document expansion

HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Modeling anchor text and classifying queries to enhance web document retrieval

Proceedings of the 17th international conference on World Wide Web
A general optimization framework for smoothing language models on graph structures

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Information Retrieval

Introduction to Information Retrieval
Mining term association patterns from search logs for effective query reformulation

Proceedings of the 17th ACM conference on Information and knowledge management
Building enriched document representations using aggregated anchor text

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Discovering missing click-through query language information for web search

Proceedings of the 20th ACM international conference on Information and knowledge management
Incorporating social anchors for ad hoc retrieval

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Although anchor text provides very useful information for web search, a large portion of web pages have few or no incoming hyperlinks (anchors), which is known as the anchor text sparsity problem. In this paper, we propose a language modeling based technique for overcoming anchor text sparsity by discovering a web page's plausible missing anchor text from its similar web pages' in-link anchor text. We design experiments with two publicly available TREC web corpora (GOV2 and ClueWeb09) to evaluate different approaches for discovering missing anchor text. Experimental results show that our approach can effectively discover plausible missing anchor terms. We then use the web named page finding task in the TREC Terabyte track to explore the utility of missing anchor text information discovered by our approach for helping retrieval. Experimental results show that our approach can statistically significantly improve retrieval performance, compared with several approaches that only use anchor text aggregated over the web graph.