Refinement of TF-IDF schemes for web pages using their hyperlinked neighboring pages

Authors:
Kazunari Sugiyama;Kenji Hatano;Masatoshi Yoshikawa;Shunsuke Uemura
Affiliations:
Nara Institute of Science and Technology, Nara, Japan;Nara Institute of Science and Technology, Nara, Japan;Nagoya University, Aichi, Japan;Nara Institute of Science and Technology, Nara, Japan
Venue:
Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Year:
2003

Citing 17
Cited 19

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Cut as a querying unit for WWW, Netnews, and E-mail

Proceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems
Improved algorithms for topic distillation in a hyperlinked environment

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Topical locality in the Web

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
What is this page known for? Computing Web page reputations

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction

Proceedings of the 10th international conference on World Wide Web
Retrieving and organizing web pages by “information unit”

Proceedings of the 10th international conference on World Wide Web
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic-sensitive PageRank

Proceedings of the 11th international conference on World Wide Web
Improvement of HITS-based algorithms on web documents

Proceedings of the 11th international conference on World Wide Web
Modern Information Retrieval

Modern Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Web Structure, Dynamics and Page Quality

SPIRE 2002 Proceedings of the 9th International Symposium on String Processing and Information Retrieval
A Method of Improving Feature Vector for Web Pages Reflecting the Contents of Their Out-Linked Pages

DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications

Adaptive web search based on user profile constructed without any effort from users

Proceedings of the 13th international conference on World Wide Web
Managing distributed collections: evaluating web page changes, movement, and replacement

Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries
Web object indexing using domain knowledge

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Searching a file system using inferred semantic links

Proceedings of the sixteenth ACM conference on Hypertext and hypermedia
Just-in-time recovery of missing web pages

Proceedings of the seventeenth conference on Hypertext and hypermedia
Autonomous authoring tools for hypertext

ACM Computing Surveys (CSUR)
Using neighbors to date web documents

Proceedings of the 9th annual ACM international workshop on Web information and data management
Automatically linking live experiences captured with a ubiquitous infrastructure

Multimedia Tools and Applications
Classifiers without borders: incorporating fielded text from neighboring web pages

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Entity Ranking Based on Category Expansion

Focused Access to XML Documents
A comparison of techniques for estimating IDF values to generate lexical signatures for the web

Proceedings of the 10th ACM workshop on Web information and data management
Correlation of Term Count and Document Frequency for Google N-Grams

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Webpage relationships for information retrieval within a structured domain

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Scholarly paper recommendation via user's recent research interests

Proceedings of the 10th annual joint conference on Digital libraries
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries
Rediscovering missing web pages using link neighborhood lexical signatures

Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries
Find, new, copy, web, page - tagging for the (re-)discovery of web pages

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Pay-as-You-Go ranking of schema mappings using query logs

DILS'12 Proceedings of the 8th international conference on Data Integration in the Life Sciences
Ranking Tagged Resources Using Social Semantic Relevance

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In IR (information retrieval) systems based on the vector space model, the TF-IDF scheme is widely used to characterize documents. However, in the case of documents with hyperlink structures such as Web pages, it is necessary to develop a technique for representing the contents of Web pages more accurately by exploiting the contents of their hyperlinked neighboring pages. In this paper, we first propose several approaches to refining the TF-IDF scheme for a target Web page by using the contents of its hyperlinked neighboring pages, and then compare the retrieval accuracy of our proposed approaches. Experimental results show that, generally, more accurate feature vectors of a target Web page can be generated in the case of utilizing the contents of its hyperlinked neighboring pages at levels up to second in the backward direction from the target page.