Effectively mining wikipedia for clustering multilingual documents

Authors:
N. Kiran Kumar;G. S. K. Santosh;Vasudeva Varma
Affiliations:
International Institute of Information Technology, Hyderabad, India;International Institute of Information Technology, Hyderabad, India;International Institute of Information Technology, Hyderabad, India
Venue:
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Year:
2011

Citing 4
Cited 1

Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings

Information Retrieval
Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Multilingual document clustering using wikipedia as external knowledge

IRFC'11 Proceedings of the Second international conference on Multidisciplinary information retrieval facility

Editorial: Minimally-supervised learning of domain-specific causal relations using an open-domain corpus as knowledge base

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents Multilingual Document Clustering (MDC) usingWikipedia on comparable corpora. Particularly, we utilized the cross lingual links, category, outlinks, Infobox information present in Wikipedia to enrich the document representation. We have used Bisecting k-means algorithm for clustering multilingual documents based on the document similarities. Experiments are conducted based on the usage of English and Hindi Wikipedia. We have considered English and Hindi Datasets provided by FIRE'10 for Ad-hoc Cross-Lingual document retrieval task on Indian languages. No language specific tools are used, which makes the proposed approach easily extendable for other languages. The system is evaluated using F-score and Purity measures and the results obtained are encouraging.