Topic tracking based on bilingual comparable corpora and semisupervised clustering

  • Authors:
  • Fumiyo Fukumoto;Yoshimi Suzuki

  • Affiliations:
  • University of Yamanashi, Kofu, Japan;University of Yamanashi, Kofu, Japan

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we address the problem of skewed data in topic tracking: the small number of stories labeled positive as compared to negative stories and propose a method for estimating effective training stories for the topic-tracking task. For a small number of labeled positive stories, we use bilingual comparable, i.e., English, and Japanese corpora, together with the EDR bilingual dictionary, and extract story pairs consisting of positive and associated stories. To overcome the problem of a large number of labeled negative stories, we classified them into clusters. This is done using a semisupervised clustering algorithm, combining k means with EM. The method was tested on the TDT English corpus and the results showed that the system works well when the topic under tracking is talking about an event originating in the source language country, even for a small number of initial positive training stories.