Creating a billion-scale searchable web archive

  • Authors:
  • Daniel Gomes;Miguel Costa;David Cruz;João Miranda;Simão Fontes

  • Affiliations:
  • Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal

  • Venue:
  • Proceedings of the 22nd international conference on World Wide Web companion
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web information is ephemeral. Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996. This study contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design. The developed software is freely available as an open source project. We believe that sharing our experience obtained while developing and operating a running service will enable other organizations to start or improve their web archives.