Making large-scale support vector machine learning practical
Advances in kernel methods
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Supporting temporal text-containment queries in temporal document databases
Data & Knowledge Engineering
The LIMSI Participation in the QAst Track
Advances in Multilingual and Multimodal Information Retrieval
Improving Temporal Language Models for Determining Time of Non-timestamped Documents
ECDL '08 Proceedings of the 12th European conference on Research and Advanced Technology for Digital Libraries
Using Temporal Language Models for Document Dating
ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
Large scale analysis of changes in english vocabulary over recent time
Proceedings of the 21st ACM international conference on Information and knowledge management
Hi-index | 0.00 |
Automatically determining the publication date of a document is a complex task, since a document may contain only few intratextual hints about its publication date. Yet, it has many important applications. Indeed, the amount of digitized historical documents is constantly increasing, but their publication dates are not always properly identified via OCR acquisition. Accurate knowledge about publication dates is crucial for many applications, e.g. studying the evolution of documents topics over a certain period of time. In this article, we present a method for automatically determining the publication dates of documents, which was evaluated on a French newspaper corpus in the context of the DEFT 2011 evaluation campaign. Our system is based on a combination of different individual systems, relying both on supervised and unsupervised learning, and uses several external resources, e.g. Wikipedia, Google Books Ngrams, and etymological background knowledge about the French language. Our system detects the correct year of publication in 10% of the cases for 300-word excerpts and in 14% of the cases for 500-word excerpts, which is very promising given the complexity of the task.