Language-independent named entity identification using Wikipedia

  • Authors:
  • Mahathi Bhagavatula;Santosh GSK;Vasudeva Varma

  • Affiliations:
  • Search and Information Extraction Lab, IIIT Hyderabad;Search and Information Extraction Lab, IIIT Hyderabad;Search and Information Extraction Lab, IIIT Hyderabad

  • Venue:
  • MM '12 Proceedings of the First Workshop on Multilingual Modeling
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Recognition of Named Entities (NEs) is a difficult process in Indian languages like Hindi, Telugu, etc., where sufficient gazetteers and annotated corpora are not available compared to English language. This paper details a novel clustering and co-occurrence based approach to map English NEs with their equivalent representations from different languages recognized in a language-independent way. We have substituted the required language specific resources by the richly structured multilingual content of Wikipedia. The approach includes clustering of highly similar Wikipedia articles. Then the NEs in an English article are mapped with other language terms in interlinked articles based on co-occurrence frequencies. The cluster information and the term co-occurrences are considered in extracting the NEs from non-English languages. Hence, the English Wikipedia is used to bootstrap the NEs for other languages. Through this approach, we have availed the structured, semi-structured and multilingual content of the Wikipedia to a massive extent. Experimental results suggest that the proposed approach yields promising results in rates of precision and recall.