A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents

Authors:
N. Kiran Kumar;G. S. K. Santosh;Vasudeva Varma
Affiliations:
International Institute of Information Technology, Hyderabad, India;International Institute of Information Technology, Hyderabad, India;International Institute of Information Technology, Hyderabad, India
Venue:
CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
Year:
2011

Citing 6
Cited 0

A vector space model for automatic indexing

Communications of the ACM
Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings

Information Retrieval
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Multilingual document clustering: an heuristic approach based on cognate named entities

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Ranking multilingual documents using minimal language dependent resources

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Multilingual document clustering using wikipedia as external knowledge

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a language-independent Multilingual Document Clustering (MDC) approach on comparable corpora. Named entites (NEs) such as persons, locations, organizations play a major role in measuring the document similarity. We propose a method to identify these NEs present in under-resourced Indian languages (Hindi and Marathi) using the NEs present in English, which is a high resourced language. The identified NEs are then utilized for the formation of multilingual document clusters using the Bisecting k-means clustering algorithm. We didn't make use of any non-English linguistic tools or resources such as WordNet, Part-Of-Speech tagger, bilingual dictionaries, etc., which makes the proposed approach completely language-independent. Experiments are conducted on a standard dataset provided by FIRE1 for their 2010 Ad-hoc Cross-Lingual document retrieval task on Indian languages. We have considered English, Hindi and Marathi news datasets for our experiments. The system is evaluated using F-score, Purity and Normalized Mutual Information measures and the results obtained are encouraging.