Double-pass clustering technique for multilingual document collections

Authors:
Kazuaki Kishida
Affiliations:
Keio University, Japan
Venue:
Journal of Information Science
Year:
2011

Citing 24
Cited 2

Non-hierarchical document clustering using the ICL distribution array processor

SIGIR '87 Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval
Clustering algorithms

Information retrieval
Improving query translation for cross-language information retrieval using statistical models

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with cluster refinement and model selection capabilities

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Using Statistical Term Similarity for Sense Disambiguationin Cross-Language Information Retrieval

Information Retrieval
Multilingual Document Clustering, Topic Extraction and Data Transformations

EPIA '01 Proceedings of the10th Portuguese Conference on Artificial Intelligence on Progress in Artificial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving
CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
Introduction to topic detection and tracking

Topic detection and tracking
Probabilistic approaches to topic detection and tracking

Topic detection and tracking
An NLP & IR approach to topic detection

Topic detection and tracking
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
A multilingual news summarizer

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Technical issues of cross-language information retrieval: a review

Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Improved cross-language retrieval using backoff translation

HLT '01 Proceedings of the first international conference on Human language technology research
Multilingual and cross-lingual news topic tracking

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Term disambiguation techniques based on target document collection for cross-language information retrieval: an empirical comparison of performance between techniques

Information Processing and Management: an International Journal
Multilingual news clustering: Feature translation vs. identification of cognate named entities

Pattern Recognition Letters
A Latent Semantic Indexing-based approach to multilingual document clustering

Decision Support Systems
Search Engines: Information Retrieval in Practice

Search Engines: Information Retrieval in Practice
Cross-lingual document clustering

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
High-speed rough clustering for very large document collections

Journal of the American Society for Information Science and Technology
Cross-Language Information Retrieval

Cross-Language Information Retrieval
A neural network model for hierarchical multilingual text categorization

ISNN'05 Proceedings of the Second international conference on Advances in neural networks - Volume Part II

Probability-based text clustering algorithm by alternately repeating two operations

Journal of Information Science
Cross-language patent matching via an international patent classification-based concept bridge

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual news articles. However, there have been few studies on multilingual document clustering to date. In particular, it is not known whether clustering techniques are effective in medium- or large-scale multilingual document sets. For scalability, techniques should be based on dictionary-based translation and a single- or double-pass clustering algorithm. This article reports on experiments of applying multilingual document clustering to medium-scale sets of English, French, German and Italian documents (Reuters news articles). The results show that the double-pass algorithm has a positive effect in the case that each document is translated. On the other hand, the cluster translation strategy in which clusters obtained by applying a clustering algorithm to each language document set are translated has almost no effect. Also, translation disambiguation techniques can improve, but only slightly, the effectiveness of clustering.