Applying the Site Information to the Information Retrieval from the Web

Authors:
Yasuhito Asano;Hiroshi Imai;Masashi Toyoda;Masaru Kitsuregawa
Affiliations:
-;-;-;-
Venue:
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Year:
2002

Citing 0
Cited 3

Exploiting the hierarchical structure for link analysis

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Web-site boundary detection

ICDM'10 Proceedings of the 10th industrial conference on Advances in data mining: applications and theoretical aspects
Mining communities on the web using a max-flow and a site-oriented framework

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, several information retrieval methodsusing information about the Web-links are developed, suchas HITS and Trawling. In order to analyze the Web-linksdividing into links inside each Web site (local-links) andlinks between Web sites (global-links) for the informationretrieval, it is required that a proper model of the Web site,a phrase used ambiguously in daily life. In the existing researches,a Web server is used as a model of the Web site.This idea works relatively well in case that a Web site correspondsto a server such as public Web sites, but workspoorly in case that multiple Web sites correspond to a serversuch as private Web sites on rental Web servers. In this paper,we propose a new model of the Web site, "directory-basedsite" to handle typical private sites, and a methodto identify them using information about the URL and theWeb-links. We verify the method can approximately identifyabout 66% of over 110 thousands servers whether eachserver has multiple directory-based sites or not, and extractover 500 thousands of directory-based sites and 4 millionglobal-links by computational experiments using jp-domainURLs and Web-links data contains over 23 millionURLs and 100 million Web-links, collected from July to August2000, by Toyoda and Kitsuregawa. We also proposea new framework of the Web-links based information retrievalthat uses the directory-based sites and the global-linksinstead of the Web pages and the whole Web-links respectively,and examine effectiveness of our framework bycomparing a result of Trawling on our framework to one onthe existing framework.