EuroWordNet
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
A multilingual news summarizer
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Multilingual and cross-lingual news topic tracking
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Columbia Newsblaster: multilingual news summarization on the web
HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004
Multilingual news clustering: Feature translation vs. identification of cognate named entities
Pattern Recognition Letters
Feature-based method for document alignment in comparable news corpora
EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Bilingual news clustering using named entities and fuzzy similarity
TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Type level clustering evaluation: new measures and a POS induction case study
CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
IEEE Transactions on Fuzzy Systems
CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
Multilingual news document clustering: two algorithms based on cognate named entities
TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Hi-index | 0.00 |
This paper presents an approach for Multilingual Document Clustering in comparable corpora. The algorithm is of heuristic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main advantages of this approach is that it does not depend on bilingual or multilingual resources. However, it depends on the possibility of identifying cognate named entities between the languages used in the corpus. An additional advantage of the approach is that it does not need any information about the right number of clusters; the algorithm calculates it. We have tested this approach with a comparable corpus of news written in English and Spanish. In addition, we have compared the results with a system which translates selected document features. The obtained results are encouraging.