Revisiting the case for explicit syntactic information in language models

  • Authors:
  • Ariya Rastrow;Sanjeev Khudanpur;Mark Dredze

  • Affiliations:
  • Johns Hopkins University Baltimore, MD;Johns Hopkins University Baltimore, MD;Johns Hopkins University Baltimore, MD

  • Venue:
  • WLM '12 Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Statistical language models used in deployed systems for speech recognition, machine translation and other human language technologies are almost exclusively n-gram models. They are regarded as linguistically naïve, but estimating them from any amount of text, large or small, is straightforward. Furthermore, they have doggedly matched or outperformed numerous competing proposals for syntactically well-motivated models. This unusual resilience of n-grams, as well as their weaknesses, are examined here. It is demonstrated that n-grams are good word-predictors, even linguistically speaking, in a large majority of word-positions, and it is suggested that to improve over n-grams, one must explore syntax-aware (or other) language models that focus on positions where n-grams are weak.