A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

Authors:
Yuqing Guo;Haifeng Wang;Josef van Genabith
Affiliations:
Toshiba (China) Research and Development Center;Toshiba (China) Research and Development Center;Dublin City University
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2010

Citing 11
Cited 4

Word association norms, mutual information, and lexicography

Computational Linguistics
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A maximum entropy approach to identifying sentence boundaries

ANLC '97 Proceedings of the fifth conference on Applied natural language processing
On the parameter space of generative lexicalized statistical parsing models

On the parameter space of generative lexicalized statistical parsing models
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Robust PCFG-based generation using automatically acquired LFG approximations

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news

Speech Communication
Dependency-based n-gram models for general purpose sentence realisation

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A more precise analysis of punctuation for broad-coverage surface realization with CCG

GEAF '08 Proceedings of the Workshop on Grammar Engineering Across Frameworks
Automatic semantic role labeling for Chinese verbs

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence

Automatic comma insertion for Japanese text generation

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Chinese sentence segmentation as comma classification

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Dependency-based n-gram models for general purpose sentence realisation

Natural Language Engineering
Glue rules for robust chart realization

ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article investigates a relatively underdeveloped subject in natural language processing---the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctuation marks as defined in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences output by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves an f-score of 79.83% for punctuation insertion and 74.61% for punctuation restoration using gold data input, 79.50% for insertion and 73.32% for restoration using parser-based imperfect input. The experiments show that the MaxEnt model significantly outperforms a baseline 5-gram language model that scores 54.99% for punctuation insertion and 52.01% for restoration. We show that our results are not far from human performance on the same task with human insertion f-scores in the range of 81-87% and human restoration in the range of 71-82%. Finally, a manual error analysis of the generation output shows that close to 40% of the mismatched punctuation marks do in fact result in acceptable choices, a fact obscured in the automatic string-matching based evaluation scores.