When was it written? automatically determining publication dates

  • Authors:
  • Anne Garcia-Fernandez;Anne-Laure Ligozat;Marco Dinarelli;Delphine Bernhard

  • Affiliations:
  • CEA-LIST, DIASI-LVIC lab at Fontenay-Aux-Roses and LIMSI-CNRS, Orsay, France;LIMSI-CNRS, Orsay and ENSIIE, Evry, France;LIMSI-CNRS, Orsay, France;LIMSI-CNRS, Orsay, France

  • Venue:
  • SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Automatically determining the publication date of a document is a complex task, since a document may contain only few intratextual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time. In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.