Query expansion using local and global document analysis
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
First story detection in TDT is hard
Proceedings of the ninth international conference on Information and knowledge management
Topic Detection and Tracking: Event-Based Information Organization
Topic Detection and Tracking: Event-Based Information Organization
Generating query substitutions
Proceedings of the 15th international conference on World Wide Web
Using names and topics for new event detection
HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
Improved statistical machine translation using paraphrases
HLT-NAACL '06 Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics
Constructing corpora for the development and evaluation of paraphrase systems
Computational Linguistics
Syntactic constraints on paraphrases extracted from parallel corpora
EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
Streaming first story detection with application to Twitter
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Online generation of locality sensitive hash signatures
ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
Generating phrasal and sentential paraphrases: A survey of data-driven methods
Computational Linguistics
Unified analysis of streaming news
Proceedings of the 20th international conference on World wide web
Geo-spatial event detection in the twitter stream
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Building a large-scale corpus for evaluating event detection on twitter
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
First story detection (FSD) involves identifying first stories about events from a continuous stream of documents. A major problem in this task is the high degree of lexical variation in documents which makes it very difficult to detect stories that talk about the same event but expressed using different words. We suggest using paraphrases to alleviate this problem, making this the first work to use paraphrases for FSD. We show a novel way of integrating paraphrases with locality sensitive hashing (LSH) in order to obtain an efficient FSD system that can scale to very large datasets. Our system achieves state-of-the-art results on the first story detection task, beating both the best supervised and unsupervised systems. To test our approach on large data, we construct a corpus of events for Twitter, consisting of 50 million documents, and show that paraphrasing is also beneficial in this domain.