Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

Authors:
Ralf Steinberger;Bruno Pouliquen;Johan Hagman
Affiliations:
-;-;-
Venue:
CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2002

Citing 2
Cited 17

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Automatic text processing: the transformation, analysis, and retrieval of information by computer
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics

Multilingual document clustering: an heuristic approach based on cognate named entities

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Multilingual news clustering: Feature translation vs. identification of cognate named entities

Pattern Recognition Letters
CLBCRA-Approach for Combination of Content-Based and Link-Based Ranking in Web Search

ADMA '07 Proceedings of the 3rd international conference on Advanced Data Mining and Applications
Feature-based method for document alignment in comparable news corpora

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Correlation clustering for crosslingual link detection

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Part of Speech (POS) Tag Sets Reduction and Analysis Using Rough Set Techniques

RSFDGrC '09 Proceedings of the 12th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing
Bilingual news clustering using named entities and fuzzy similarity

TSD'07 Proceedings of the 10th international conference on Text, speech and dialogue
Plagiarism detection across distant language pairs

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Cross-lingual document representation and semantic similarity measure: a fuzzy set and rough set based approach

IEEE Transactions on Fuzzy Systems
An event-centric model for multilingual document similarity

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Multilingual document clustering using wikipedia as external knowledge

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility
Effectively mining wikipedia for clustering multilingual documents

NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Optimizing personalized retrieval system based on web ranking

CSR'06 Proceedings of the First international computer science conference on Theory and Applications
Multilingual news document clustering: two algorithms based on cognate named entities

TSD'06 Proceedings of the 9th international conference on Text, Speech and Dialogue
Supporting collaboration in Wikipedia between language communities

Proceedings of the 4th international conference on Intercultural Collaboration
ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora

ACL '12 Proceedings of the ACL 2012 System Demonstrations
Cross-Language high similarity search using a conceptual thesaurus

CLEF'12 Proceedings of the Third international conference on Information Access Evaluation: multilinguality, multimodality, and visual analytics

Quantified Score

Hi-index	0.01

Visualization

Abstract

We are presenting an approach to calculating the semantic similarity of documents written in the same or in different languages. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these representations. While EUROVOC is a carefully handcrafted knowledge structure, our procedure uses statistical techniques. The method was applied to a collection of 5990 English and Spanish parallel texts and evaluated by measuring the number of times the translation of a given document was identified as the most similar document. The good results showed the feasibility and usefulness of the approach.