Disyllabic Chinese Word Extraction Based on Character Thesaurus and Semantic Constraints in Word-Formation

  • Authors:
  • Sun Maosong;Xu Dongliang;Benjamin K. T'Sou;Lu Huaming

  • Affiliations:
  • The State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Dept. of Computer Sci. & Tech., Tsinghua University, Beijing, C ...;The State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Dept. of Computer Sci. & Tech., Tsinghua University, Beijing, C ...;Language Information Sciences Research Center, City University of Hong Kong,;Beijing Information Science and Technology University, Beijing, China 100085

  • Venue:
  • TSD '08 Proceedings of the 11th international conference on Text, Speech and Dialogue
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a novel approach to Chinese disyllabic word extraction based on semantic information of characters. Two thesauri of Chinese characters, manually-crafted and machine-generated, are conducted. A Chinese wordlist with 63,738 two-character words, together with the character thesauri, are explored to learn semantic constraints between characters in Chinese word-formation, resulting in two types of semantic-tag-based HMM. Experiments show that: (1) both schemes outperform their character-based counterpart; (2) the machine-generated thesaurus outperforms the hand-crafted one to some extent in word extraction, and (3) the proper combination of semantic-tag-based and character-based methods could benefit word extraction.