Using search engine to construct a scalable corpus for Vietnamese lexical development for word segmentation

Authors:
Doan Nguyen
Affiliations:
Hewlett-Packard Company
Venue:
ALR7 Proceedings of the 7th Workshop on Asian Language Resources
Year:
2009

Citing 2
Cited 1

Mining the web to create minority language corpora

Proceedings of the tenth international conference on Information and knowledge management
Query preprocessing: improving web search through a Vietnamese word tokenization approach

Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Text disambiguation using support vector machine: an initial study

PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

As the web content becomes more accessible to the Vietnamese community across the globe, there is a need to process Vietnamese query texts properly to find relevant information. The recent deployment of a Vietnamese translation tool on a well-known search engine justifies its importance in gaining popularity with the World Wide Web. There are still problems in the translation and retrieval of Vietnamese language as its word recognition is not fully addressed. In this paper we introduce a semi-supervised approach in building a general scalable web corpus for Vietnamese using search engine to facilitate the word segmentation process. Moreover, we also propose a segmentation algorithm which recognizes effectively Out-Of-Vocabulary (OOV) words. The result indicates that our solution is scalable and can be applied for real time translation program and other linguistic applications. This work is here is a continuation of the work of Nguyen D. (2008).