Cross-dataset clustering: revealing corresponding themes across multiple corpora

  • Authors:
  • Ido Dagan;Zvika Marx;Eli Shamir

  • Affiliations:
  • Bar-Ilan University, Ramat-Gan, Israel;Bar-Ilan University, Ramat-Gan, Israel;The Hebrew University, Jerusalem, Israel

  • Venue:
  • COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a method for identifying corresponding themes across several corpora that are focused on related, but distinct, domains. This task is approached through simultaneous clustering of keyword sets extracted from the analyzed corpora. Our algorithm extends the information-bottleneck soft clustering method for a suitable setting consisting of several datasets. Experimentation with topical corpora reveals similar aspects of three distinct religions. The evaluation is by way of comparison to clusters constructed manually by an expert.