Characteristics and retrieval effectiveness of n-gram string similarity matching on Malay documents

Authors:
Tengku Mohd T. Sembok;Zainab Abu Bakar
Affiliations:
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia;Faculty of Mathematical Sciences and Information Technology, Universiti Teknology MARA, Shah Alam, Malaysia
Venue:
ACACOS'11 Proceedings of the 10th WSEAS international conference on Applied computer and applied computational science
Year:
2011

Citing 5
Cited 0

An Insight into the Entropy and Redundancy of the English Dictionary

IEEE Transactions on Pattern Analysis and Machine Intelligence
Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System

Journal of the American Society for Information Science
Approximate String Matching

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

There have been very few studies of the use of conflation algorithms for indexing and retrieval of Malay documents as compared to English. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. There is only one existing Malay stemming algorithm and this provide a benchmark for the following experiments using n-gram string similarity algorithms, in particular bigram and trigram, using the same Malay queries and documents. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both nonstemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occuring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency (itf) weights are also used. Interpolation technique and standard recall-precision functions are used to calculate recall-precision values. These values are then compared to the available recall-precision values of the only Malay stemming algorithm.