Character cluster based Thai information retrieval

Authors:
Thanaruk Theeramunkong;Virach Sornlertlamvanich;Thanasan Tanhermhong;Wirat Chinnan
Affiliations:
Information Technology Program, Sirindhorn International Institute of Technology, Thammasat University, P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani 12121, Thailand;Software and Engineering Laboratory, National Electronics and Computer Technology Center (NECTEC), National Science and Technology Development Agency (NSTDA), Gypsum Metropolitan Tower 22nd Floor, ...;Information Technology Program, Sirindhorn International Institute of Technology, Thammasat University, P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani 12121, Thailand;Information Technology Program, Sirindhorn International Institute of Technology, Thammasat University, P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani 12121, Thailand
Venue:
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Year:
2000

Citing 4
Cited 5

Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
Inverted files

Information retrieval
Suffix arrays: a new method for on-line string searches

SODA '90 Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms
A fast string searching algorithm

Communications of the ACM

Non-dictionary-based Thai word segmentation using decision trees

HLT '01 Proceedings of the first international conference on Human language technology research
Combining prediction by partial matching and logistic regression for Thai word segmentation

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatic construction of a lexical attribute knowledge base

KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
A minimum cluster-based trigram statistical model for Thai syllabification

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Simultaneous character-cluster-based word segmentation and named entity recognition in Thai language

KICSS'10 Proceedings of the 5th international conference on Knowledge, information, and creativity support systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Some languages including Thai, Japanese and Chinese do not have explicit word boundary. This causes the problem of word boundary ambiguity that results in decreasing the accuracy of information retrieval. This paper proposes a new technique so-called character clustering to reduce the ambiguity of word boundary in Thai documents and hence improve searching efficiency. To investigate the efficiency, a set of experiments using Thai newspapers is conducted in both non-indexing and indexing searching approaches. The experimental results show our method outperform the traditional methods in both non-indexing and indexing approaches in all measures.