A fast string searching algorithm
Communications of the ACM
Journal of the American Society for Information Science and Technology
Categorizing unknown words: using decision trees to identify names and misspellings
ANLC '00 Proceedings of the sixth conference on Applied natural language processing
POS disambiguation and unknown word guessing with decision trees
EACL '99 Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics
Unknown word extraction for Chinese documents
COLING '02 Proceedings of the 19th international conference on Computational linguistics - Volume 1
Japanese unknown word identification by character-based chunking
COLING '04 Proceedings of the 20th international conference on Computational Linguistics
LearnLexTo: a machine-learning based word segmentation for indexing Thai texts
Proceedings of the 2nd ACM workshop on Improving non english web searching
PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
ICCCI'10 Proceedings of the Second international conference on Computational collective intelligence: technologies and applications - Volume Part II
Boosting-based ensemble learning with penalty profiles for automatic Thai unknown word recognition
Computers & Mathematics with Applications
Hi-index | 0.00 |
We propose a collaborative framework for collecting Thai unknown words found on Web pages over the Internet. Our main goal is to design and construct a Web-based system which allows a group of interested users to participate in constructing a Thai unknown-word open dictionary. The proposed framework provides supporting algorithms and tools for automatically identifying and extracting unknown words from Web pages of given URLs. The system yields the result of unknown-word candidates which are presented to the users for verification. The approved unknown words could be combined with the set of existing words in the lexicon to improve the performance of many NLP tasks such as word segmentation, information retrieval and machine translation. Our framework includes word segmentation and morphological analysis modules for handling the non-segmenting characteristic of Thai written language. To take advantage of large available text resource on the Web, our unknown-word boundary identification approach is based on the statistical string pattern-matching algorithm.