Multilingual document clustering: an heuristic approach based on cognate named entities

Authors:
Soto Montalvo;Raquel Martínez;Arantza Casillas;Víctor Fresno
Affiliations:
GAVAB Group, URJC;NLP&IR Group, UNED;UPV-EHU;GAVAB Group, URJC
Venue:
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Year:
2006

Citing 5
Cited 7

Introduction to EuroWordNet

EuroWordNet
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
A multilingual news summarizer

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Multilingual and cross-lingual news topic tracking

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Columbia Newsblaster: multilingual news summarization on the web

HLT-NAACL--Demonstrations '04 Demonstration Papers at HLT-NAACL 2004

Multilingual news clustering: Feature translation vs. identification of cognate named entities

Pattern Recognition Letters
Feature-based method for document alignment in comparable news corpora

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Bilingual news clustering using named entities and fuzzy similarity

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Type level clustering evaluation: new measures and a POS induction case study

CoNLL '10 Proceedings of the Fourteenth Conference on Computational Natural Language Learning
Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach

IEEE Transactions on Fuzzy Systems
A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents

CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
Multilingual news document clustering: two algorithms based on cognate named entities

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents an approach for Multilingual Document Clustering in comparable corpora. The algorithm is of heuristic nature and it uses as unique evidence for clustering the identification of cognate named entities between both sides of the comparable corpora. One of the main advantages of this approach is that it does not depend on bilingual or multilingual resources. However, it depends on the possibility of identifying cognate named entities between the languages used in the corpus. An additional advantage of the approach is that it does not need any information about the right number of clusters; the algorithm calculates it. We have tested this approach with a comparable corpus of news written in English and Spanish. In addition, we have compared the results with a system which translates selected document features. The obtained results are encouraging.