Accessor variety criteria for Chinese word extraction

  • Authors:
  • Haodi Feng;Kang Chen;Xiaotie Deng;Weimin Zheng

  • Affiliations:
  • Shandong University, City University of Hong Kong, School of Computer Science and Technology, Jinan, PRC/ Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong;Tsinghua University, Department of Computer Science and Technology, Peking, PR China;City University of Hong Kong, Department of Computer Science, Tat Chee Avenue, Kowloon, Hong Kong;Tsinghua University, Department of Computer Science and Technology, Peking, PR China

  • Venue:
  • Computational Linguistics
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, 'percent', and, 'more and more', are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.