A Chinese dictionary construction algorithm for information retrieval

  • Authors:
  • Honglan Jin;Kam-Fai Wong

  • Affiliations:
  • The Chinese University of Hong Kong;The Chinese University of Hong Kong

  • Venue:
  • ACM Transactions on Asian Language Information Processing (TALIP)
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this article we propose a method for constructing, from raw Chinese text, a statistics-based automatic dictionary. The method makes use of local statistical information (i.e., data within a document) to identify and discard repeated string patterns, which, at an earlier stage, were substrings of legitimate words. Global statistical information (which exists throughout the entire corpus) and contextual constraints are then used for further filtering. The method can be used to alleviate the out-of-vocabulary (OOV) problem, which is commonly found in dictionary-based natural language information-processing applications, e.g., word segmentation. It can handle text corpora dynamically and, further, it does not impose any strict requirements on the size and quality of the training corpora. Based on our method, we constructed Chinese dictionaries from different Chinese corpora. We then applied the words in the constructed dictionaries to indexing in information retrieval (IR). Retrieval performance using such indexes was compared to the same, but based on indexes produced by static dictionaries. Three Chinese corpora using various character-encoding schemes and language styles were used in the experiments. The results show that retrieval using indexes based on the constructed dictionary is effective. This implies that fully automatic Chinese dictionary construction based on dynamic data sources, e.g., from the Internet, for the purposes of IR is feasible. Drawing on the experiment, we were able to make some interesting observations: (1) using only a portion of a dictionary is enough to produce good retrieval performance, e.g., a dictionary consisting of only the 500 highest-frequency strings extracted from the NTCIR 2 Chinese corpus produced as good a retrieval result as using a more complete dictionary with over 100K entries; and (2) complete word segmentation is not a strict requirement for achieving practical information retrieval.