A Clustering Framework to Build Focused Web Crawlers for Automatic Extraction of Cultural Information

  • Authors:
  • George E. Tsekouras;Damianos Gavalas;Stefanos Filios;Antonios D. Niros;George Bafaloukas

  • Affiliations:
  • Department of Cultural Technology and Communication, University of the Aegean, Lesvos, Greece 81100;Department of Cultural Technology and Communication, University of the Aegean, Lesvos, Greece 81100;Department of Cultural Technology and Communication, University of the Aegean, Lesvos, Greece 81100;Department of Cultural Technology and Communication, University of the Aegean, Lesvos, Greece 81100;Department of Cultural Technology and Communication, University of the Aegean, Lesvos, Greece 81100

  • Venue:
  • SETN '08 Proceedings of the 5th Hellenic conference on Artificial Intelligence: Theories, Models and Applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a novel focused crawling method for extracting and processing cultural data from the web in a fully automated fashion. After downloading the pages, we extract from each document a number of words for each thematic cultural area. We then create multidimensional document vectors comprising the most frequent word occurrences. The dissimilarity between these vectors is measured by the Hamming distance. In the last stage, we employ cluster analysis to partition the document vectors into a number of clusters. Finally, our approach is illustrated via a proof-of-concept application which scrutinizes hundreds of web pages spanning different cultural thematic areas.