Near-duplicate detection by instance-level constrained clustering
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
NewsCATS: A News Categorization and Trading System
ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Similar Document Detection with Limited Information Disclosure
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient partial-duplicate detection based on sequence matching
Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Connecting the dots between news articles
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Hi-index | 0.00 |
Institutional investors are building increasingly more sophisticated algorithmic trading engines that account for textual as well as numerical information. To train these engines they need large datasets of information with highly accurate timestamps that cover long periods with differing trading conditions. Thus, the demand for temporal news datasets beyond the point where full archives are available is increasing. Rebuilding the actual temporal news dataset that was transmitted to the market relies on merging multiple datasets, each with incomplete information and sometimes questionable quality. Doing so requires near duplicate detection in a very large dataset including news in many languages. This research is novel as in our scenario we are unaware of the language used in any given news article. In this paper we describe a language independent near duplicate detection algorithm and demonstrate its performance on a dataset consisting of tens of millions of news messages in over 20 languages consisting of hundreds of gigabytes of content.