Hybrid term indexing for different IR models

Authors:
Ken C. W. Chow;Robert W. P. Luk;K. F. Wong;K. L. Kwok
Affiliations:
Hong Kong Polytechnic University, Dept. Computing, Kowloon, Hong Kong;-;Chinese University of Hong Kong, Dept. Systems Eng. and Eng. Management, Shatin, Hong Kong;Queens College, CUNY, Dept. Computer Science, New York
Venue:
IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Year:
2000

Citing 6
Cited 1

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
ACTS: an automatic Chinese text segmentation system for full text retrieval

Journal of the American Society for Information Science
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
The Monte Carlo method and the evaluation of retrieval system performance

Journal of the American Society for Information Science
Critical tokenization and its properties

Computational Linguistics

Handling orthographic varieties in japanese IR: fusion of word-, n-gram-, and yomi-based indices across different document collections

AIRS'05 Proceedings of the Second Asia conference on Asia Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Retrieval effectiveness depends on how terms are extracted and indexed. For Chinese text (and others like Japanese and Korean), there are no space to delimit words. Indexing using hybrid terms (i.e. words and bigrams) were able to achieve the best precision amongst homogenous terms at a lower storage cost than indexing with bigrams. However, this was tested with conjunctive queries. Here, we extended the weighted Boolean models using fuzzy and p-norm measures, as well as the vector space model using the cosine measure, for processing hybrid terms. Our evaluation shows that all IR models using hybrid terms achieve better average precision over those using words. Across different recall values, the weighted Boolean model using fuzzy measures with hybrid terms achieve consistently about 8% higher than those using words. The vector space model using the cosine measures with hybrid terms achieved the best improvement in the average recall and precision.