Characteristics and retrieval effectiveness of n-gram string similarity matching on Malay documents

  • Authors:
  • Tengku Mohd T. Sembok;Zainab Abu Bakar

  • Affiliations:
  • Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia;Faculty of Mathematical Sciences and Information Technology, Universiti Teknology MARA, Shah Alam, Malaysia

  • Venue:
  • ACACOS'11 Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

There have been very few studies of the use of conflation algorithms for indexing and retrieval of Malay documents as compared to English. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. There is only one existing Malay stemming algorithm and this provide a benchmark for the following experiments using n-gram string similarity algorithms, in particular bigram and trigram, using the same Malay queries and documents. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both nonstemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occuring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency (itf) weights are also used. Interpolation technique and standard recall-precision functions are used to calculate recall-precision values. These values are then compared to the available recall-precision values of the only Malay stemming algorithm.