Word association norms, mutual information, and lexicography
Computational Linguistics
Foundations of statistical natural language processing
Foundations of statistical natural language processing
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
A maximum entropy approach to identifying sentence boundaries
ANLC '97 Proceedings of the fifth conference on Applied natural language processing
On the parameter space of generative lexicalized statistical parsing models
On the parameter space of generative lexicalized statistical parsing models
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger
EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Robust PCFG-based generation using automatically acquired LFG approximations
ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Dependency-based n-gram models for general purpose sentence realisation
COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
A more precise analysis of punctuation for broad-coverage surface realization with CCG
GEAF '08 Proceedings of the Workshop on Grammar Engineering Across Frameworks
Automatic semantic role labeling for Chinese verbs
IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Automatic comma insertion for Japanese text generation
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Chinese sentence segmentation as comma classification
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Dependency-based n-gram models for general purpose sentence realisation
Natural Language Engineering
Glue rules for robust chart realization
ENLG '11 Proceedings of the 13th European Workshop on Natural Language Generation
Hi-index | 0.00 |
This article investigates a relatively underdeveloped subject in natural language processing---the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctuation marks as defined in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences output by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves an f-score of 79.83% for punctuation insertion and 74.61% for punctuation restoration using gold data input, 79.50% for insertion and 73.32% for restoration using parser-based imperfect input. The experiments show that the MaxEnt model significantly outperforms a baseline 5-gram language model that scores 54.99% for punctuation insertion and 52.01% for restoration. We show that our results are not far from human performance on the same task with human insertion f-scores in the range of 81-87% and human restoration in the range of 71-82%. Finally, a manual error analysis of the generation output shows that close to 40% of the mismatched punctuation marks do in fact result in acceptable choices, a fact obscured in the automatic string-matching based evaluation scores.