Building a research library for the history of the web

Authors:
William Y. Arms;Selcuk Aya;Pavel Dmitriev;Blazej J. Kot;Ruth Mitchell;Lucia Walle
Affiliations:
Cornell University Ithaca, NY;Cornell University Ithaca, NY;Cornell University Ithaca, NY;Cornell University Ithaca, NY;Cornell University Ithaca, NY;Cornell University Ithaca, NY
Venue:
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
Year:
2006

Citing 4
Cited 8

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
Authoritative sources in a hyperlinked environment

Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms
Collection synthesis

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
The Google file system

SOSP '03 Proceedings of the nineteenth ACM symposium on Operating systems principles

Detecting age of page content

Proceedings of the 9th annual ACM international workshop on Web information and data management
Towards mining past content of Web pages

The New Review of Hypermedia and Multimedia - Web Archiving
EverLast: a distributed architecture for preserving the web

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
A framework for describing web repositories

Proceedings of the 9th ACM/IEEE-CS joint conference on Digital libraries
Honto? search: estimating trustworthiness of web information by search results aggregation and temporal analysis

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Behavioral simulations in MapReduce

Proceedings of the VLDB Endowment
Automatic knowledge acquisition from historical document archives: historiographical perspective

Culture and computing
Generating content for digital libraries using an interactive content management system

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the building of a research library for studying the Web, especially research on how the structure and content of the Web change over time. The library is particularly aimed at supporting social scientists for whom the Web is both a fascinating social phenomenon and a mirror on society.The library is built on the collections of the Internet Archive, which has been preserving a crawl of the Web every two months since 1996. The technical challenges in organizing this data for research fall into two categories: high-performance computing to transfer and manage the very large amounts of data, and human-computer interfaces that empower research by non-computer specialists.