Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study

  • Authors:
  • Wenbin Jiang;Liang Huang;Qun Liu

  • Affiliations:
  • Key Lab. of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences, Beijing, China;Google Research, Charleston Rd. Mountain View, CA;Key Lab. of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences, Beijing, China

  • Venue:
  • ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Manually annotated corpora are valuable but scarce resources, yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another. We present a simple yet effective strategy that transfers knowledge from a differently annotated corpus to the corpus with desired annotation. We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. Experiments show that adaptation from the much larger People's Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies (with error reductions of 30.2% and 14%, respectively), which in turn helps improve Chinese parsing accuracy.