Mining the web to create minority language corpora
Proceedings of the tenth international conference on Information and knowledge management
Query preprocessing: improving web search through a Vietnamese word tokenization approach
Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
Text disambiguation using support vector machine: an initial study
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Hi-index | 0.00 |
As the web content becomes more accessible to the Vietnamese community across the globe, there is a need to process Vietnamese query texts properly to find relevant information. The recent deployment of a Vietnamese translation tool on a well-known search engine justifies its importance in gaining popularity with the World Wide Web. There are still problems in the translation and retrieval of Vietnamese language as its word recognition is not fully addressed. In this paper we introduce a semi-supervised approach in building a general scalable web corpus for Vietnamese using search engine to facilitate the word segmentation process. Moreover, we also propose a segmentation algorithm which recognizes effectively Out-Of-Vocabulary (OOV) words. The result indicates that our solution is scalable and can be applied for real time translation program and other linguistic applications. This work is here is a continuation of the work of Nguyen D. (2008).