Identifying "soft 404" error pages: analyzing the lexical signatures of documents in distributed collections

  • Authors:
  • Luis Meneses;Richard Furuta;Frank Shipman

  • Affiliations:
  • Center for the Study of Digital Libraries and Department of Computer Science and Engineering, Texas A&M University, College Station, TX;Center for the Study of Digital Libraries and Department of Computer Science and Engineering, Texas A&M University, College Station, TX;Center for the Study of Digital Libraries and Department of Computer Science and Engineering, Texas A&M University, College Station, TX

  • Venue:
  • TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Collections of Web-based resources are often decentralized; leaving the task of identifying and locating removed resources to collection managers who must rely on http response codes. When a resource is no longer available, the server is supposed to return a 404 error code. In practice and to be friendlier to human readers, many servers respond with a 200 OK code and indicate in the text of the response that the document is no longer available. In the reported study, 3.41% of servers respond in this manner. To help collection managers identify these "friendly" or "soft" 404s, we developed two methods that use a Naïve Bayes classifier based on known valid responses and known 404 responses. The classifier was able to predict soft 404 pages with a precision of 99% and a recall of 92%. We will also elaborate on the results obtained from our study and will detail the lessons learned.