A collaborative framework for collecting Thai unknown words from the web

Authors:
Choochart Haruechaiyasak;Chatchawal Sangkeettrakarn;Pornpimon Palingoon;Sarawoot Kongyoung;Chaianun Damrongrat
Affiliations:
National Electronics and Computer Technology Center (NECTEC), Klong Luang, Pathumthani, Thailand;National Electronics and Computer Technology Center (NECTEC), Klong Luang, Pathumthani, Thailand;National Electronics and Computer Technology Center (NECTEC), Klong Luang, Pathumthani, Thailand;National Electronics and Computer Technology Center (NECTEC), Klong Luang, Pathumthani, Thailand;National Electronics and Computer Technology Center (NECTEC), Klong Luang, Pathumthani, Thailand
Venue:
COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
Year:
2006

Citing 6
Cited 4

A fast string searching algorithm

Communications of the ACM
Using statistical and contextual information to identify two-and three-character words in Chinese text

Journal of the American Society for Information Science and Technology
Categorizing unknown words: using decision trees to identify names and misspellings

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
POS disambiguation and unknown word guessing with decision trees

EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Unknown word extraction for Chinese documents

COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Japanese unknown word identification by character-based chunking

COLING '04 Proceedings of the 20th international conference on Computational Linguistics

LearnLexTo: a machine-learning based word segmentation for indexing Thai texts

Proceedings of the 2nd ACM workshop on Improving non english web searching
A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Boosting-based ensemble learning with penalty setting profiles for automatic Thai unknown word recognition

ICCCI'10 Proceedings of the Second international conference on Computational collective intelligence: technologies and applications - Volume Part II
Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition

Computers & Mathematics with Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a collaborative framework for collecting Thai unknown words found on Web pages over the Internet. Our main goal is to design and construct a Web-based system which allows a group of interested users to participate in constructing a Thai unknown-word open dictionary. The proposed framework provides supporting algorithms and tools for automatically identifying and extracting unknown words from Web pages of given URLs. The system yields the result of unknown-word candidates which are presented to the users for verification. The approved unknown words could be combined with the set of existing words in the lexicon to improve the performance of many NLP tasks such as word segmentation, information retrieval and machine translation. Our framework includes word segmentation and morphological analysis modules for handling the non-segmenting characteristic of Thai written language. To take advantage of large available text resource on the Web, our unknown-word boundary identification approach is based on the statistical string pattern-matching algorithm.