Character cluster based Thai information retrieval

  • Authors:
  • Thanaruk Theeramunkong;Virach Sornlertlamvanich;Thanasan Tanhermhong;Wirat Chinnan

  • Affiliations:
  • Information Technology Program, Sirindhorn International Institute of Technology, Thammasat University, P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani 12121, Thailand;Software and Engineering Laboratory, National Electronics and Computer Technology Center (NECTEC), National Science and Technology Development Agency (NSTDA), Gypsum Metropolitan Tower 22nd Floor, ...;Information Technology Program, Sirindhorn International Institute of Technology, Thammasat University, P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani 12121, Thailand;Information Technology Program, Sirindhorn International Institute of Technology, Thammasat University, P.O. Box 22 Thammasat Rangsit Post Office, Pathumthani 12121, Thailand

  • Venue:
  • IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

Some languages including Thai, Japanese and Chinese do not have explicit word boundary. This causes the problem of word boundary ambiguity that results in decreasing the accuracy of information retrieval. This paper proposes a new technique so-called character clustering to reduce the ambiguity of word boundary in Thai documents and hence improve searching efficiency. To investigate the efficiency, a set of experiments using Thai newspapers is conducted in both non-indexing and indexing searching approaches. The experimental results show our method outperform the traditional methods in both non-indexing and indexing approaches in all measures.