An Insight into the Entropy and Redundancy of the English Dictionary
IEEE Transactions on Pattern Analysis and Machine Intelligence
Comparing words, stems, and roots as index terms in an Arabic Information Retrieval System
Journal of the American Society for Information Science
ACM Computing Surveys (CSUR)
Information Retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Hi-index | 0.00 |
There have been very few studies of the use of conflation algorithms for indexing and retrieval of Malay documents as compared to English. The two main classes of conflation algorithms are string-similarity algorithms and stemming algorithms. There is only one existing Malay stemming algorithm and this provide a benchmark for the following experiments using n-gram string similarity algorithms, in particular bigram and trigram, using the same Malay queries and documents. Inherent characteristics of n-grams and several variations of experiments performed on the queries and documents are discussed. The variations are: both nonstemmed queries and documents; stemmed queries and nonstemmed documents; and both stemmed queries and documents. Further experiment are then carried out by removing the most frequently occuring n-gram. The dice-coefficient is used as threshold and weight in ranking the retrieved documents. Beside using dice coefficients to rank documents, inverse document frequency (itf) weights are also used. Interpolation technique and standard recall-precision functions are used to calculate recall-precision values. These values are then compared to the available recall-precision values of the only Malay stemming algorithm.