Using bilingual comparable corpora and semi-supervised clustering for topic tracking

Authors:
Fumiyo Fukumoto;Yoshimi Suzuki
Affiliations:
Univ. of Yamanashi;Univ. of Yamanashi
Venue:
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Year:
2006

Citing 9
Cited 0

Improving text categorization methods for event tracking

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised and supervised clustering for topic tracking

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition

Machine Translation
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Machine translation vs. dictionary term translation: a comparison for English-Japanese news article alignment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Language-specific models in multilingual topic tracking

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem dealing with skewed data, and propose a method for estimating effective training stories for the topic tracking task. For a small number of labelled positive stories, we extract story pairs which consist of positive and its associated stories from bilingual comparable corpora. To overcome the problem of a large number of labelled negative stories, we classify them into some clusters. This is done by using k-means with EM. The results on the TDT corpora show the effectiveness of the method.