A bottom-up merging algorithm for Chinese unknown word extraction

  • Authors:
  • Wei-Yun Ma;Keh-Jiann Chen

  • Affiliations:
  • Institute of Information science, Academia Sinica;Institute of Information science, Academia Sinica

  • Venue:
  • SIGHAN '03 Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17
  • Year:
  • 2003

Quantified Score

Hi-index 0.01

Visualization

Abstract

Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.