A comparison of Chinese document indexing strategies and retrieval models

Authors:
Robert W. P. Luk;K. L. Kwok
Affiliations:
The Hong Kong Polytechnic University, Hong Kong;Queens College, City University of New York
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2002

Citing 14
Cited 7

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
A network approach to probabilistic information retrieval

ACM Transactions on Information Systems (TOIS)
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
PAT-tree-based keyword extraction for Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Employing multiple representations for Chinese information retrieval

Journal of the American Society for Information Science
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
PM-based indexing for Chinese text retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
On the use of words and n-grams for Chinese information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
Chinese document indexing based on a new partitioned signature file: model and evaluation

Journal of the American Society for Information Science and Technology
From E-Sex to E-Commerce: Web Search Changes

Computer
Document Ranking and the Vector-Space Model

IEEE Software
Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

ANLC '94 Proceedings of the fourth conference on Applied natural language processing

Comparative study of monolingual and multilingual search models for use with asian languages

ACM Transactions on Asian Language Information Processing (TALIP)
Adapting pivoted document-length normalization for query size: Experiments in Chinese and English

ACM Transactions on Asian Language Information Processing (TALIP)
A retrospective study of a hybrid document-context based retrieval model

Information Processing and Management: an International Journal
Comparing different units for query translation in Chinese cross-language information retrieval

Proceedings of the 2nd international conference on Scalable information systems
A formal approach to evaluate and compare internet search engines: a case study on searching the chinese web

APWeb'05 Proceedings of the 7th Asia-Pacific web conference on Web Technologies Research and Development
Statistical and comparative evaluation of various indexing and search models

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
The adaptability of english based web search algorithms to chinese search engines

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the advent of the Internet and intranets, substantial interest is being shown in Asian language information retrieval; especially in Chinese, which is a good example of an Asian ideographic language (other examples include Japanese and Korean). Since, in this type of language, spaces do not delimit words, an important issue is which index terms should be extracted from documents. This issue also has wider implications for indexing other languages such as agglutinating languages (e.g., Finnish and Turkish), archaic ideographic languages like Egyptian hieroglyphs, and other types of information such as data stored in genomic databases. Although comparisons of indexing strategies for Chinese documents have been made, almost all of them are based on a single retrieval model. This article compares the performance of various combinations of indexing strategies (i.e., character, word, short-word, bigram, and Pircs indexing) and retrieval models (i.e., vector space, 2-Poisson, logistic regression, and Pircs models). We determine which model (and its parameters) achieves the (near) best retrieval effectiveness without relevance feedback, and compare it with the open evaluations (i.e., TREC and NTCIR) for both long and title queries. In addition, we describe a more extensive investigation of retrieval efficiency. In particular, the storage cost of word indexing is only slightly more than character indexing, and bigram indexing is about double the storage cost of other indexing strategies. The retrieval time typically varies linearly with the number of unique terms in the query, which is supported by correlation values above 90%. The Pircs retrieval system achieves robust and good retrieval performance, but it appears to be the slowest method, whereas vector space models were not very effective in retrieval, but were able to respond quickly. For robust, near-best retrieval effectiveness, without considering storage overhead, the 2-Poisson model using bigram indexing appears to be a good compromise between retrieval effectiveness and efficiency for both long and title queries.