Corpus-based Sinhala lexicon

  • Authors:
  • Ruvan Weerasinghe;Dulip Herath;Viraj Welgama

  • Affiliations:
  • University of Colombo School of Computing, Colombo, Sri Lanka;University of Colombo School of Computing, Colombo, Sri Lanka;University of Colombo School of Computing, Colombo, Sri Lanka

  • Venue:
  • ALR7 Proceedings of the 7th Workshop on Asian Language Resources
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Lexicon is in important resource in any kind of language processing application. Corpus-based lexica have several advantages over other traditional approaches. The lexicon developed for Sinhala was based on the text obtained from a corpus of 10 million words drawn from diverse genres. The words extracted from the corpus have been labeled with parts of speech categories defined according to a novel classification proposed for Sinhala. The lexicon reports 80% coverage over unrestricted text obtained from online sources. The lexicon has been implemented in Lexical Mark up Framework.