Using search engine to construct a scalable corpus for Vietnamese lexical development for word segmentation

  • Authors:
  • Doan Nguyen

  • Affiliations:
  • Hewlett-Packard Company

  • Venue:
  • ALR7 Proceedings of the 7th Workshop on Asian Language Resources
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

As the web content becomes more accessible to the Vietnamese community across the globe, there is a need to process Vietnamese query texts properly to find relevant information. The recent deployment of a Vietnamese translation tool on a well-known search engine justifies its importance in gaining popularity with the World Wide Web. There are still problems in the translation and retrieval of Vietnamese language as its word recognition is not fully addressed. In this paper we introduce a semi-supervised approach in building a general scalable web corpus for Vietnamese using search engine to facilitate the word segmentation process. Moreover, we also propose a segmentation algorithm which recognizes effectively Out-Of-Vocabulary (OOV) words. The result indicates that our solution is scalable and can be applied for real time translation program and other linguistic applications. This work is here is a continuation of the work of Nguyen D. (2008).