Mining wikipedia revision histories for improving sentence compression

  • Authors:
  • Elif Yamangil;Rani Nelken

  • Affiliations:
  • Harvard University, Cambridge, MA;Harvard University, Cambridge, MA

  • Venue:
  • HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

A well-recognized limitation of research on supervised sentence compression is the dearth of available training data. We propose a new and bountiful resource for such training data, which we obtain by mining the revision history of Wikipedia for sentence compressions and expansions. Using only a fraction of the available Wikipedia data, we have collected a training corpus of over 380,000 sentence pairs, two orders of magnitude larger than the standardly used Ziff-Davis corpus. Using this new-found data, we propose a novel lexicalized noisy channel model for sentence compression, achieving improved results in grammaticality and compression rate criteria with a slight decrease in importance.