Topic tracking based on bilingual comparable corpora and semisupervised clustering

Authors:
Fumiyo Fukumoto;Yoshimi Suzuki
Affiliations:
University of Yamanashi, Kofu, Japan;University of Yamanashi, Kofu, Japan
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2007

Citing 14
Cited 2

Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
On-line new event detection and tracking

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Improving text categorization methods for event tracking

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised and supervised clustering for topic tracking

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Topic Detection and Tracking: Event-Based Information Organization

Topic Detection and Tracking: Event-Based Information Organization
Termight: Coordinating Humans and Machines in Bilingual Terminology Acquisition

Machine Translation
Constrained K-means Clustering with Background Knowledge

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Semi-supervised Clustering by Seeding

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A month to topic detection and tracking in Hindi

ACM Transactions on Asian Language Information Processing (TALIP)
Machine translation vs. dictionary term translation: a comparison for English-Japanese news article alignment

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
Language-specific models in multilingual topic tracking

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Effect of cross-language IR in bilingual lexicon acquisition from comparable corpora

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1

Relation Discovery from Thai News Articles Using Association Rule Mining

PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
A News Analysis and Tracking System

PReMI '09 Proceedings of the 3rd International Conference on Pattern Recognition and Machine Intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we address the problem of skewed data in topic tracking: the small number of stories labeled positive as compared to negative stories and propose a method for estimating effective training stories for the topic-tracking task. For a small number of labeled positive stories, we use bilingual comparable, i.e., English, and Japanese corpora, together with the EDR bilingual dictionary, and extract story pairs consisting of positive and associated stories. To overcome the problem of a large number of labeled negative stories, we classified them into clusters. This is done using a semisupervised clustering algorithm, combining k means with EM. The method was tested on the TDT English corpus and the results showed that the system works well when the topic under tracking is talking about an event originating in the source language country, even for a small number of initial positive training stories.