Syntactic annotations for the Google Books Ngram Corpus

Authors:
Yuri Lin;Jean-Baptiste Michel;Erez Lieberman Aiden;Jon Orwant;Will Brockman;Slav Petrov
Affiliations:
Google Inc.;Google Inc.;Google Inc.;Google Inc.;Google Inc.;Google Inc.
Venue:
ACL '12 Proceedings of the ACL 2012 System Demonstrations
Year:
2012

Citing 6
Cited 0

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Building a large annotated corpus of English: the penn treebank

Computational Linguistics - Special issue on using large corpora: II
Pseudo-projective dependency parsing

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
QuestionBank: creating a corpus of parse-annotated questions

ACL-44 Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics
Algorithms for deterministic incremental dependency parsing

Computational Linguistics
Part-of-speech tagging from 97% to 100%: is it time for some linguistics?

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and head-modifier relationships are recorded. The annotations are produced automatically with statistical models that are specifically adapted to historical text. The corpus will facilitate the study of linguistic trends, especially those related to the evolution of syntax.