Mining wikipedia revision histories for improving sentence compression

Authors:
Elif Yamangil;Rani Nelken
Affiliations:
Harvard University, Cambridge, MA;Harvard University, Cambridge, MA
Venue:
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Year:
2008

Citing 5
Cited 9

Statistics-Based Summarization - Step One: Sentence Compression

Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence
Forest-based statistical sentence generation

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Three generative, lexicalised models for statistical parsing

ACL '98 Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
Studying cooperation and conflict between authors with history flow visualizations

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Supervised and unsupervised learning for sentence compression

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics

Bayesian synchronous tree-substitution grammar induction and its application to sentence compression

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Wikipedia revision toolkit: efficiently accessing Wikipedia's edit history

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Simple English Wikipedia: a new text simplification task

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Learning to simplify sentences using Wikipedia

MTTG '11 Proceedings of the Workshop on Monolingual Text-To-Text Generation
Learning to simplify sentences with quasi-synchronous grammar and integer programming

EMNLP '11 Proceedings of the Conference on Empirical Methods in Natural Language Processing
User edits classification using document revision histories

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Collaboratively built semi-structured content and Artificial Intelligence: The story so far

Artificial Intelligence
An abstractive approach to sentence compression

ACM Transactions on Intelligent Systems and Technology (TIST) - Special Sections on Paraphrasing; Intelligent Systems for Socially Aware Computing; Social Computing, Behavioral-Cultural Modeling, and Prediction
WHAD: Wikipedia historical attributes data

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

A well-recognized limitation of research on supervised sentence compression is the dearth of available training data. We propose a new and bountiful resource for such training data, which we obtain by mining the revision history of Wikipedia for sentence compressions and expansions. Using only a fraction of the available Wikipedia data, we have collected a training corpus of over 380,000 sentence pairs, two orders of magnitude larger than the standardly used Ziff-Davis corpus. Using this new-found data, we propose a novel lexicalized noisy channel model for sentence compression, achieving improved results in grammaticality and compression rate criteria with a slight decrease in importance.