Structural features for predicting the linguistic quality of text: applications to machine translation, automatic summarization and human-authored text

Authors:
Ani Nenkova;Jieun Chae;Annie Louis;Emily Pitler
Affiliations:
University of Pennsylvania;University of Pennsylvania;University of Pennsylvania;University of Pennsylvania
Venue:
Empirical methods in natural language generation
Year:
2010

Citing 33
Cited 3

Centering: a framework for modeling the local coherence of discourse

Computational Linguistics
Summarization beyond sentence extraction: a probabilistic approach to sentence compression

Artificial Intelligence
Optimizing search engines using clickthrough data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Sentence reduction for automatic text summarization

ANLC '00 Proceedings of the sixth conference on Applied natural language processing
A maximum-entropy-inspired parser

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Generation that exploits corpus-based statistical knowledge

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Exploiting a probabilistic hierarchical model for generation

COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
A machine learning approach to the automatic evaluation of machine translation

ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
BLEU: a method for automatic evaluation of machine translation

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Automatic evaluation of summaries using N-gram co-occurrence statistics

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Probabilistic text structuring: experiments with sentence ordering

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
Headline generation based on statistical translation

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Sentence Fusion for Multidocument News Summarization

Computational Linguistics
Evaluation metrics for generation

INLG '00 Proceedings of the first international conference on Natural language generation - Volume 14
Discriminative Reranking for Natural Language Parsing

Computational Linguistics
Coarse-to-fine n-best parsing and MaxEnt discriminative reranking

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Supervised and unsupervised learning for sentence compression

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Reading level assessment using support vector machines and statistical language models

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Models for sentence compression: a comparison across domains, training requirements and evaluation measures

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Discourse generation using utility-trained coherence models

COLING-ACL '06 Proceedings of the COLING/ACL on Main conference poster sessions
DUC in context

Information Processing and Management: an International Journal
Abstractive headline generation using WIDL-expressions

Information Processing and Management: an International Journal
Multi-candidate reduction: Sentence compression as a tool for document summarization tasks

Information Processing and Management: an International Journal
Modeling local coherence: An entity-based approach

Computational Linguistics
A machine learning approach to reading level assessment

Computer Speech and Language
Evaluating centering for information ordering using corpora

Computational Linguistics
Mind the gap: dangers of divorcing evaluations of summary content from linguistic quality

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Choosing the right translation: a syntactically informed classification approach

COLING '08 Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1
Stochastic realisation ranking for a free word order language

ENLG '07 Proceedings of the Eleventh European Workshop on Natural Language Generation
Revisiting readability: a unified framework for predicting text quality

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Automatic evaluation of text coherence: models and representations

IJCAI'05 Proceedings of the 19th international joint conference on Artificial intelligence
Natural language generation as planning under uncertainty for spoken dialogue systems

Empirical methods in natural language generation
Human evaluation of a german surface realisation ranker

Empirical methods in natural language generation

Human evaluation of a german surface realisation ranker

Empirical methods in natural language generation
ULISSE: an unsupervised algorithm for detecting reliable dependency parses

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
READ-IT: assessing readability of Italian texts with a view to text simplification

SLPAT '11 Proceedings of the Second Workshop on Speech and Language Processing for Assistive Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Sentence structure is considered to be an important component of the overall linguistic quality of text. Yet few empirical studies have sought to characterize how and to what extent structural features determine fluency and linguistic quality. We report the results of experiments on the predictive power of syntactic phrasing statistics and other structural features for these aspects of text. Manual assessments of sentence fluency for machine translation evaluation and text quality for summarization evaluation are used as gold-standard. We find that many structural features related to phrase length are weakly but significantly correlated with fluency and classifiers based on the entire suite of structural features can achieve high accuracy in pairwise comparison of sentence fluency and in distinguishing machine translations from human translations. We also test the hypothesis that the learned models capture general fluency properties applicable to human-authored text. The results from our experiments do not support the hypothesis. At the same time structural features and models based on them prove to be robust for automatic evaluation of the linguistic quality of multidocument summaries.