Mirror, mirror on the Web: a study of host pairs with replicated content
WWW '99 Proceedings of the eighth international conference on World Wide Web
Finding replicated Web collections
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
On the Resemblance and Containment of Documents
SEQUENCES '97 Proceedings of the Compression and Complexity of Sequences 1997
What's new on the web?: the evolution of the web from a search engine perspective
Proceedings of the 13th international conference on World Wide Web
Impact of search engines on page popularity
Proceedings of the 13th international conference on World Wide Web
Shuffling a stacked deck: the case for partially randomized ranking of search engine results
VLDB '05 Proceedings of the 31st international conference on Very large data bases
What's really new on the web?: identifying new pages from a series of unstable web snapshots
Proceedings of the 15th international conference on World Wide Web
Dynamics of the Chilean web structure
Computer Networks: The International Journal of Computer and Telecommunications Networking - Web dynamics
Characterization of national Web domains
ACM Transactions on Internet Technology (TOIT)
Efficient search in large textual collections with redundancy
Proceedings of the 16th international conference on World Wide Web
Proceedings of the 16th international conference on World Wide Web
A model for fast web mining prototyping
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Proceedings of the Second ACM International Conference on Web Search and Data Mining
Detecting the origin of text segments efficiently
Proceedings of the 18th international conference on World wide web
Understanding content reuse on the web: static and dynamic analyses
WebKDD'06 Proceedings of the 8th Knowledge discovery on the web international conference on Advances in web mining and web usage analysis
Proceedings of the 21st ACM conference on Hypertext and hypermedia
Evaluating methods to rediscover missing web pages from the web infrastructure
Proceedings of the 10th annual joint conference on Digital libraries
Chapter 2: next generation web search
Search Computing
Detecting quilted web pages at scale
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Hi-index | 0.00 |
This paper presents an extensive study about the evolution of textual content on the Web, which shows how some new pages are created from scratch while others are created using already existing content. We show that a significant fraction of the Web is a byproduct of the latter case. We introduce the concept of Web genealogical tree, in which every page in a Web snapshot is classified into a component. We study in detail these components, characterizing the copies and identifying the relation between a source of content and a search engine, by comparing page relevance measures, documents returned by real queries performed in the past, and click-through data. We observe that sources of copies are more frequently returned by queries and more clicked than other documents.