Locality sensitive hashing for scalable structural classification and clustering of web documents

  • Authors:
  • Christian Hachenberg;Thomas Gottron

  • Affiliations:
  • University of Koblenz-Landau, Koblenz, Germany;University of Koblenz-Landau, Koblenz, Germany

  • Venue:
  • Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information extraction techniques leverage such templates as an essential cornerstone of their functionality but rely heavily on the availability of proper training documents based on the specific template. Thus, structural classification and structural clustering of web documents is an important contributing factor to the success of those methods. We introduce a novel technique to support these two tasks: template fingerprints. Template fingerprints are locality sensitive hash values in the form of short sequences of characters which effectively represent the underlying template of a web document. Small changes in the document structure, as they may occur in template based documents, lead to no or only minor variations in the corresponding fingerprint. Based on the fingerprints we introduce a scalable index structure and algorithm for large collections of web documents, which can retrieve structurally similar documents efficiently. The effectiveness of our approach is empirically validated in a classification task on a data set of 13,237 documents based on 50 templates from different domains. The general efficiency and scalability is evaluated in a clustering task on a data set retrieved from the Open Directory Project comprising more than 3.6 million web documents. For both tasks, our template fingerprint approach provides results of high quality and demonstrates a linear runtime of O(n) w.r.t. the number of documents.