Reassembling multilingual temporal news datasets with incomplete information

Authors:
Calum S. Robertson
Affiliations:
The University of New South Wales, Sydney, NSW, Australia
Venue:
AusDM '11 Proceedings of the Ninth Australasian Data Mining Conference - Volume 121
Year:
2011

Citing 5
Cited 0

Near-duplicate detection by instance-level constrained clustering

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
NewsCATS: A News Categorization and Trading System

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Similar Document Detection with Limited Information Disclosure

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient partial-duplicate detection based on sequence matching

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Connecting the dots between news articles

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Institutional investors are building increasingly more sophisticated algorithmic trading engines that account for textual as well as numerical information. To train these engines they need large datasets of information with highly accurate timestamps that cover long periods with differing trading conditions. Thus, the demand for temporal news datasets beyond the point where full archives are available is increasing. Rebuilding the actual temporal news dataset that was transmitted to the market relies on merging multiple datasets, each with incomplete information and sometimes questionable quality. Doing so requires near duplicate detection in a very large dataset including news in many languages. This research is novel as in our scenario we are unaware of the language used in any given news article. In this paper we describe a language independent near duplicate detection algorithm and demonstrate its performance on a dataset consisting of tens of millions of news messages in over 20 languages consisting of hundreds of gigabytes of content.