Automatic corpus-based Thai word extraction with the c4.5 learning algorithm

Authors:
Virach Sornlertlamvanich;Tanapong Potipiti;Thatsanee Charoenporn
Affiliations:
Ministry of Science and Technology Environment, Bangkok, Thailand;Ministry of Science and Technology Environment, Bangkok, Thailand;Ministry of Science and Technology Environment, Bangkok, Thailand
Venue:
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 2
Year:
2000

Citing 5
Cited 16

C4.5: programs for machine learning

C4.5: programs for machine learning
Adaptive multilingual sentence boundary disambiguation

Computational Linguistics
Statistical decision-tree models for parsing

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 1
The automatic extraction of open compounds from text corpora

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2

Towards an intelligent multilingual keyboard system

HLT '01 Proceedings of the first international conference on Human language technology research
The state of the art in Thai language processing

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Improving translation quality of rule-based machine translation

COLING-MTIA '02 Proceedings of the 2002 COLING workshop on Machine translation in Asia - Volume 16
Two-character Chinese word extraction based on hybrid of internal and contextual measures

SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
Word segmentation for the Myanmar language

Journal of Information Science
Determining the Dependency Among Clauses Based on Machine Learning Techniques

ICANNGA '07 Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I
Extracting Semantic Frames from Thai Medical-Symptom Phrases with Unknown Boundaries

ASWC '08 Proceedings of the 3rd Asian Semantic Web Conference on The Semantic Web
Statistical-Based Approach to Non-segmented Language Processing

IEICE - Transactions on Information and Systems
Research on Domain Term Extraction Based on Conditional Random Fields

ICCPOL '09 Proceedings of the 22nd International Conference on Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy
Thai Word Segmentation with Hidden Markov Model and Decision Tree

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Chinese term extraction using minimal resources

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A concept in error correction of text editors: case study Thai-English set

Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human
A delimiter-based general approach for Chinese term extraction

Journal of the American Society for Information Science and Technology
Comparison of various machine learning-based classifications of relative clauses

ACS'06 Proceedings of the 6th WSEAS international conference on Applied computer science
Syntactic analysis of long sentences based on s-clauses

IJCNLP'04 Proceedings of the First international joint conference on Natural Language Processing
Word extraction based on semantic constraints in chinese word-formation

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

"Word" is difficult to define in the languages that do not exhibit explicit word boundary, such as Thai. Traditional methods on defining words for this kind of languages have to depend on human judgement which bases on unclear criteria or procedures, and have several limitations. This paper proposes an algorithm for word extraction from Thai texts without borrowing a hand from word segmentation. We employ the c4.5 learning algorithm for this task. Several attributes such as string length, frequency, mutual information and entropy are chosen for word/non-word determination. Our experiment yields high precision results about 85% in both training and test corpus.