Applying the Site Information to the Information Retrieval from the Web

  • Authors:
  • Yasuhito Asano;Hiroshi Imai;Masashi Toyoda;Masaru Kitsuregawa

  • Affiliations:
  • -;-;-;-

  • Venue:
  • WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, several information retrieval methodsusing information about the Web-links are developed, suchas HITS and Trawling. In order to analyze the Web-linksdividing into links inside each Web site (local-links) andlinks between Web sites (global-links) for the informationretrieval, it is required that a proper model of the Web site,a phrase used ambiguously in daily life. In the existing researches,a Web server is used as a model of the Web site.This idea works relatively well in case that a Web site correspondsto a server such as public Web sites, but workspoorly in case that multiple Web sites correspond to a serversuch as private Web sites on rental Web servers. In this paper,we propose a new model of the Web site, "directory-basedsite" to handle typical private sites, and a methodto identify them using information about the URL and theWeb-links. We verify the method can approximately identifyabout 66% of over 110 thousands servers whether eachserver has multiple directory-based sites or not, and extractover 500 thousands of directory-based sites and 4 millionglobal-links by computational experiments using jp-domainURLs and Web-links data contains over 23 millionURLs and 100 million Web-links, collected from July to August2000, by Toyoda and Kitsuregawa. We also proposea new framework of the Web-links based information retrievalthat uses the directory-based sites and the global-linksinstead of the Web pages and the whole Web-links respectively,and examine effectiveness of our framework bycomparing a result of Trawling on our framework to one onthe existing framework.