Syntactic annotations for the Google Books Ngram Corpus

  • Authors:
  • Yuri Lin;Jean-Baptiste Michel;Erez Lieberman Aiden;Jon Orwant;Will Brockman;Slav Petrov

  • Affiliations:
  • Google Inc.;Google Inc.;Google Inc.;Google Inc.;Google Inc.;Google Inc.

  • Venue:
  • ACL '12 Proceedings of the ACL 2012 System Demonstrations
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and head-modifier relationships are recorded. The annotations are produced automatically with statistical models that are specifically adapted to historical text. The corpus will facilitate the study of linguistic trends, especially those related to the evolution of syntax.