Construction of Vietnamese corpora for named entity recognition

  • Authors:
  • Pham T. X. Thao;T. Q. Tri;Ai Kawazoe;Dien Dinh;Nigel Collier

  • Affiliations:
  • University of Information Technology - VNU of HCMC Vietnam;University of Information Technology - VNU of HCMC Vietnam;National Institute of Informatics, Tokyo, Japan;University of Natural Sciences - VNU of HCMC Vietnam;National Institute of Informatics, Tokyo, Japan

  • Venue:
  • Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

In order to build an automatic named entity recognition (NER) system using a machine learning approach, a large tagged corpus is widely seen as one necessary knowledge resource. Nevertheless, manual construction is time consuming, labor intensive and expensive. Building NER corpora for European languages has been extensively studied while some less-studied languages such as Vietnamese have not yet received much attention. This paper describes construction of a Vietnamese corpus, Vietnamese guidelines for annotators and a tagging tool that we make publicly available. We report on a comparison with the English named entity (NE) corpus in our multilingual NER system.