When was it written? automatically determining publication dates

Authors:
Anne Garcia-Fernandez;Anne-Laure Ligozat;Marco Dinarelli;Delphine Bernhard
Affiliations:
CEA-LIST, DIASI-LVIC lab at Fontenay-Aux-Roses and LIMSI-CNRS, Orsay, France;LIMSI-CNRS, Orsay and ENSIIE, Evry, France;LIMSI-CNRS, Orsay, France;LIMSI-CNRS, Orsay, France
Venue:
SPIRE'11 Proceedings of the 18th international conference on String processing and information retrieval
Year:
2011

Citing 6
Cited 1

Making large-scale support vector machine learning practical

Advances in kernel methods
Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive Care Monitoring

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Supporting temporal text-containment queries in temporal document databases

Data & Knowledge Engineering
The LIMSI Participation in the QAst Track

Advances in Multilingual and Multimodal Information Retrieval
Improving Temporal Language Models for Determining Time of Non-timestamped Documents

ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Using Temporal Language Models for Document Dating

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II

Large scale analysis of changes in english vocabulary over recent time

Proceedings of the 21st ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automatically determining the publication date of a document is a complex task, since a document may contain only few intratextual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time. In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.