Reassembling multilingual temporal news datasets with incomplete information

  • Authors:
  • Calum S. Robertson

  • Affiliations:
  • The University of New South Wales, Sydney, NSW, Australia

  • Venue:
  • AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Institutional investors are building increasingly more sophisticated algorithmic trading engines that account for textual as well as numerical information. To train these engines they need large datasets of information with highly accurate timestamps that cover long periods with differing trading conditions. Thus, the demand for temporal news datasets beyond the point where full archives are available is increasing. Rebuilding the actual temporal news dataset that was transmitted to the market relies on merging multiple datasets, each with incomplete information and sometimes questionable quality. Doing so requires near duplicate detection in a very large dataset including news in many languages. This research is novel as in our scenario we are unaware of the language used in any given news article. In this paper we describe a language independent near duplicate detection algorithm and demonstrate its performance on a dataset consisting of tens of millions of news messages in over 20 languages consisting of hundreds of gigabytes of content.