Creating a billion-scale searchable web archive

Authors:
Daniel Gomes;Miguel Costa;David Cruz;João Miranda;Simão Fontes
Affiliations:
Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal;Foundation for National Scientific Computing, Lisbon, Portugal
Venue:
Proceedings of the 22nd international conference on World Wide Web companion
Year:
2013

Citing 15
Cited 0

IBM computer usability satisfaction questionnaires: psychometric evaluation and instructions for use

International Journal of Human-Computer Interaction
Time-based language models

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Prioritizing Web Usability

Prioritizing Web Usability
Modelling information persistence on the web

ICWE '06 Proceedings of the 6th international conference on Web engineering
Temporal profiles of queries

ACM Transactions on Information Systems (TOIS)
A support vector method for optimizing average precision

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
An exploration of proximity measures in information retrieval

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Search User Interfaces

Search User Interfaces
Clustering and exploring search results using timeline constructions

Proceedings of the 18th ACM conference on Information and knowledge management
Leveraging temporal dynamics of document content in relevance ranking

Proceedings of the third ACM international conference on Web search and data mining
Web Crawling

Foundations and Trends in Information Retrieval
A survey on web archiving initiatives

TPDL'11 Proceedings of the 15th international conference on Theory and practice of digital libraries: research and advanced technology for digital libraries
Design and selection criteria for a national web archive

ECDL'06 Proceedings of the 10th European conference on Research and Advanced Technology for Digital Libraries
Evaluating web archive search systems

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web information is ephemeral. Several organizations around the world are struggling to archive information from the web before it vanishes. However, users demand efficient and effective search mechanisms to access the already vast collections of historical information held by web archives. The Portuguese Web Archive is the largest full-text searchable web archive publicly available. It supports search over 1.2 billion files archived from the web since 1996. This study contributes with an overview of the lessons learned while developing the Portuguese Web Archive, focusing on web data acquisition, ranking search results and user interface design. The developed software is freely available as an open source project. We believe that sharing our experience obtained while developing and operating a running service will enable other organizations to start or improve their web archives.