Genealogical trees on the web: a search engine user perspective

Authors:
Ricardo Baeza-Yates;Álvaro Pereira;Nivio Ziviani
Affiliations:
Yahoo! Research, Barcelona, Spain;Federal University of Minas Gerais, Belo Horizonte, Brazil;Federal University of Minas Gerais, Belo Horizonte, Brazil
Venue:
Proceedings of the 17th international conference on World Wide Web
Year:
2008

Citing 12
Cited 9

Mirror, mirror on the Web: a study of host pairs with replicated content

WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Graph structure in the Web

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
On the Resemblance and Containment of Documents

SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
What's new on the web?: the evolution of the web from a search engine perspective

Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity

Proceedings of the 13th international conference on World Wide Web
Shuffling a stacked deck: the case for partially randomized ranking of search engine results

VLDB '05 Proceedings of the 31st international conference on Very large data bases
What's really new on the web?: identifying new pages from a series of unstable web snapshots

Proceedings of the 15th international conference on World Wide Web
Dynamics of the Chilean web structure

Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Characterization of national Web domains

ACM Transactions on Internet Technology (TOIT)
Efficient search in large textual collections with redundancy

Proceedings of the 16th international conference on World Wide Web
Random web crawls

Proceedings of the 16th international conference on World Wide Web

A model for fast web mining prototyping

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Finding text reuse on the web

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently

Proceedings of the 18th international conference on World wide web
Understanding content reuse on the web: static and dynamic analyses

WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Is this a good title?

Proceedings of the 21st ACM conference on Hypertext and hypermedia
Evaluating methods to rediscover missing web pages from the web infrastructure

Proceedings of the 10th annual joint conference on Digital libraries
Chapter 2: next generation web search

Search Computing
Detecting quilted web pages at scale

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Identifying "soft 404" error pages: analyzing the lexical signatures of documents in distributed collections

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in a Web snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.