A syntactic resource for Thai: CG treebank

  • Authors:
  • Taneth Ruangrajitpakorn;Kanokorn Trakultaweekoon;Thepchai Supnithi

  • Affiliations:
  • National Electronics and Computer Technology Center, Klong, Klong Luang Pathumthani, Thailand;National Electronics and Computer Technology Center, Klong, Klong Luang Pathumthani, Thailand;National Electronics and Computer Technology Center, Klong, Klong Luang Pathumthani, Thailand

  • Venue:
  • ALR7 Proceedings of the 7th Workshop on Asian Language Resources
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents Thai syntactic resource: Thai CG treebank, a categorial approach of language resources. Since there are very few Thai syntactic resources, we designed to create treebank based on CG formalism. Thai corpus was parsed with existing CG syntactic dictionary and LALR parser. The correct parsed trees were collected as preliminary CG treebank. It consists of 50,346 trees from 27,239 utterances. Trees can be split into three grammatical types. There are 12,876 sentential trees, 13,728 noun phrasal trees, and 18,342 verb phrasal trees. There are 17,847 utterances that obtain one tree, and an average tree per an utterance is 1.85.