The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval
Journal of the American Society for Information Science
Comparing representations in Chinese information retrieval
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Chinese text retrieval without using a dictionary
Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Chinese words from unsegmented text (poster abstract)
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic techniques for phrase extraction
Information Processing and Management: an International Journal
Journal of the American Society for Information Science and Technology
Implementation of the SMART Information Retrieval System
Implementation of the SMART Information Retrieval System
Retrieving collocations by co-occurrences and word order constraints
ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Information retrieval oriented word segmentation based on character associative strength ranking
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Improved N-grams approach for web page language identification
Transactions on computational collective intelligence V
The adaptability of english based web search algorithms to chinese search engines
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
A lexicon-constrained character model for chinese morphological analysis
IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Hi-index | 0.00 |
In this article we propose a method for constructing, from raw Chinese text, a statistics-based automatic dictionary. The method makes use of local statistical information (i.e., data within a document) to identify and discard repeated string patterns, which, at an earlier stage, were substrings of legitimate words. Global statistical information (which exists throughout the entire corpus) and contextual constraints are then used for further filtering. The method can be used to alleviate the out-of-vocabulary (OOV) problem, which is commonly found in dictionary-based natural language information-processing applications, e.g., word segmentation. It can handle text corpora dynamically and, further, it does not impose any strict requirements on the size and quality of the training corpora. Based on our method, we constructed Chinese dictionaries from different Chinese corpora. We then applied the words in the constructed dictionaries to indexing in information retrieval (IR). Retrieval performance using such indexes was compared to the same, but based on indexes produced by static dictionaries. Three Chinese corpora using various character-encoding schemes and language styles were used in the experiments. The results show that retrieval using indexes based on the constructed dictionary is effective. This implies that fully automatic Chinese dictionary construction based on dynamic data sources, e.g., from the Internet, for the purposes of IR is feasible. Drawing on the experiment, we were able to make some interesting observations: (1) using only a portion of a dictionary is enough to produce good retrieval performance, e.g., a dictionary consisting of only the 500 highest-frequency strings extracted from the NTCIR 2 Chinese corpus produced as good a retrieval result as using a more complete dictionary with over 100K entries; and (2) complete word segmentation is not a strict requirement for achieving practical information retrieval.