Improving English and Chinese ad-hoc retrieval: TIPSTER text phase 3 final report

Authors:
Kui-Lam Kwok
Affiliations:
Queens College, CUNY, Flushing, NY
Venue:
TIPSTER '98 Proceedings of a workshop on held at Baltimore, Maryland: October 13-15, 1998
Year:
1998

Citing 6
Cited 0

A network approach to probabilistic information retrieval

ACM Transactions on Information Systems (TOIS)
A new method of weighting query terms for ad-hoc retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Improving two-stage ad-hoc retrieval for short queries

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Chinese information extraction and retrieval

TIPSTER '96 Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996
Natural language information retrieval: TIPSTER-2 final report

TIPSTER '96 Proceedings of a workshop on held at Vienna, Virginia: May 6-8, 1996

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigated both English and Chinese ad-hoc information retrieval (IR). Part of our objectives is to study the use of term, phrasal and topical concept level evidence, either individually or in combination, to improve retrieval accuracy. For short queries, we studied five term level techniques that together lead to improvements over standard ad-hoc 2-stage retrieval some 20% to 40% for TREC5 & 6 experiments.For long queries, we studied linguistic phrases as evidence to re-rank outputs of term level retrieval. It brings small improvements in both TREC5 & 6 experiments, but needs further confirmation. We also investigated clustering of output documents from term level retrieval. Our aim is to separate relevant and irrelevant documents into different clusters, and to re-rank the output list by groups based on query and cluster-profile matching. Investigation is still on-going.For Chinese IR, many results were confirmed or discovered. For example, accurate word segmentation is not as important as first thought, but short-word segmentation is preferable to long-word (phrase). Simple bigram representation can give very good retrieval. A stopword list is not necessary; and presence of non-content terms does not hurt evaluation results much. One only needs screening out statistical stopwords of high frequency. Character indexing by itself is not competitive, but is useful for augmenting short-words or bigrams. Best results were obtained by combining retrievals of bigram and short-word with character representation. Chinese IR returns better precision than English, and it is not clear if this is a language-related, or collection-related phenomenon.