What's really new on the web?: identifying new pages from a series of unstable web snapshots

Authors:
Masashi Toyoda;Masaru Kitsuregawa
Affiliations:
University of Tokyo, Tokyo, JAPAN;University of Tokyo, Tokyo, JAPAN
Venue:
Proceedings of the 15th international conference on World Wide Web
Year:
2006

Citing 21
Cited 10

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The connectivity server: fast access to linkage information on the Web

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Trawling the Web for emerging cyber-communities

WWW '99 Proceedings of the eighth international conference on World Wide Web
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
How dynamic is the Web?

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Breadth-first crawling yields high-quality pages

Proceedings of the 10th international conference on World Wide Web
Creating a Web community chart for navigating related communities

Proceedings of the 12th ACM conference on Hypertext and Hypermedia
Extracting Large-Scale Knowledge Bases from the Web

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
The Evolution of the Web and Implications for an Incremental Crawler

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Who Links to Whom: Mining Linkage between Web Sites

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
On the bursty evolution of blogspace

WWW '03 Proceedings of the 12th international conference on World Wide Web
A large-scale study of the evolution of web pages

WWW '03 Proceedings of the 12th international conference on World Wide Web
Stochastic models for the Web graph

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
Extracting evolution of web communities from a series of web archives

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Ranking the web frontier

Proceedings of the 13th international conference on World Wide Web
Sic transit gloria telae: towards an understanding of the web's decay

Proceedings of the 13th international conference on World Wide Web
Information diffusion through blogspace

Proceedings of the 13th international conference on World Wide Web
Trend detection through temporal link analysis

Journal of the American Society for Information Science and Technology - Special issue: Webometrics

Detecting age of page content

Proceedings of the 9th annual ACM international workshop on Web information and data management
Genealogical trees on the web: a search engine user perspective

Proceedings of the 17th international conference on World Wide Web
Estimating the Change of Web Pages

ICCS '07 Proceedings of the 7th international conference on Computational Science, Part III: ICCS 2007
A three-year study on the freshness of web search engine databases

Journal of Information Science
A method for measuring the evolution of a topic on the Web: The case of “informetrics”

Journal of the American Society for Information Science and Technology
Challenge for info-plosion

DS'07 Proceedings of the 10th international conference on Discovery science
Socio-sense: a system for analysing the societal behavior from long term web archive

APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
Calculating content recency based on timestamped and non-timestamped sources for supporting page quality estimation

Proceedings of the 2011 ACM Symposium on Applied Computing
Fires on the web: towards efficient exploring historical web graphs

DASFAA'10 Proceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I
Noise robust detection of the emergence and spread of topics on the web

Proceedings of the 2nd Temporal Web Analytics Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Identifying and tracking new information on the Web is important in sociology, marketing, and survey research, since new trends might be apparent in the new information. Such changes can be observed by crawling the Web periodically. In practice, however, it is impossible to crawl the entire expanding Web repeatedly. This means that the novelty of a page remains unknown, even if that page did not exist in previous snapshots. In this paper, we propose a novelty measure for estimating the certainty that a newly crawled page appeared between the previous and current crawls. Using this novelty measure, new pages can be extracted from a series of unstable snapshots for further analysis and mining to identify new trends on the Web. We evaluated the precision, recall, and miss rate of the novelty measure using our Japanese web archive, and applied it to a Web archive search engine.