A Chinese dictionary construction algorithm for information retrieval

Authors:
Honglan Jin;Kam-Fai Wong
Affiliations:
The Chinese University of Hong Kong;The Chinese University of Hong Kong
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2002

Citing 8
Cited 5

The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval

Journal of the American Society for Information Science
Comparing representations in Chinese information retrieval

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic techniques for phrase extraction

Information Processing and Management: an International Journal
Using statistical and contextual information to identify two-and three-character words in Chinese text

Journal of the American Society for Information Science and Technology
Implementation of the SMART Information Retrieval System

Implementation of the SMART Information Retrieval System
Retrieving collocations by co-occurrences and word order constraints

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics

Information retrieval oriented word segmentation based on character associative strength ranking

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Improved N-grams approach for web page language identification

Transactions on computational collective intelligence V
The adaptability of english based web search algorithms to chinese search engines

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
A lexicon-constrained character model for chinese morphological analysis

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
Design of chinese word segmentation system based on improved chinese converse dictionary and reverse maximum matching algorithm

WISE'06 Proceedings of the 7th international conference on Web Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article we propose a method for constructing, from raw Chinese text, a statistics-based automatic dictionary. The method makes use of local statistical information (i.e., data within a document) to identify and discard repeated string patterns, which, at an earlier stage, were substrings of legitimate words. Global statistical information (which exists throughout the entire corpus) and contextual constraints are then used for further filtering. The method can be used to alleviate the out-of-vocabulary (OOV) problem, which is commonly found in dictionary-based natural language information-processing applications, e.g., word segmentation. It can handle text corpora dynamically and, further, it does not impose any strict requirements on the size and quality of the training corpora. Based on our method, we constructed Chinese dictionaries from different Chinese corpora. We then applied the words in the constructed dictionaries to indexing in information retrieval (IR). Retrieval performance using such indexes was compared to the same, but based on indexes produced by static dictionaries. Three Chinese corpora using various character-encoding schemes and language styles were used in the experiments. The results show that retrieval using indexes based on the constructed dictionary is effective. This implies that fully automatic Chinese dictionary construction based on dynamic data sources, e.g., from the Internet, for the purposes of IR is feasible. Drawing on the experiment, we were able to make some interesting observations: (1) using only a portion of a dictionary is enough to produce good retrieval performance, e.g., a dictionary consisting of only the 500 highest-frequency strings extracted from the NTCIR 2 Chinese corpus produced as good a retrieval result as using a more complete dictionary with over 100K entries; and (2) complete word segmentation is not a strict requirement for achieving practical information retrieval.