Building a large-scale corpus for evaluating event detection on twitter

Authors:
Andrew J. McMinn;Yashar Moshfeghi;Joemon M. Jose
Affiliations:
University of Glasgow, Glasgow, United Kingdom;University of Glasgow, Glasgow, United Kingdom;University of Glasgow, Glasgow, United Kingdom
Venue:
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Year:
2013

Citing 11
Cited 0

A study of retrospective and on-line event detection

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Topic detection and tracking: event-based information organization

Topic detection and tracking: event-based information organization
Temporal and information flow based event detection from social text streams

AAAI'07 Proceedings of the 22nd national conference on Artificial intelligence - Volume 2
Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter

HICSS '10 Proceedings of the 2010 43rd Hawaii International Conference on System Sciences
Earthquake shakes Twitter users: real-time event detection by social sensors

Proceedings of the 19th international conference on World wide web
Streaming first story detection with application to Twitter

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Identifying content for planned events across social media sites

Proceedings of the fifth ACM international conference on Web search and data mining
Breaking news on twitter

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
TEDAS: A Twitter-based Event Detection and Analysis System

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering
On building a reusable Twitter corpus

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Using paraphrases for improving first story detection in news and Twitter

NAACL HLT '12 Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Quantified Score

Hi-index	0.00

Visualization

Abstract

Despite the popularity of Twitter for research, there are very few publicly available corpora, and those which are available are either too small or unsuitable for tasks such as event detection. This is partially due to a number of issues associated with the creation of Twitter corpora, including restrictions on the distribution of the tweets and the difficultly of creating relevance judgements at such a large scale. The difficulty of creating relevance judgements for the task of event detection is further hampered by ambiguity in the definition of event. In this paper, we propose a methodology for the creation of an event detection corpus. Specifically, we first create a new corpus that covers a period of 4 weeks and contains over 120 million tweets, which we make available for research. We then propose a definition of event which fits the characteristics of Twitter, and using this definition, we generate a set of relevance judgements aimed specifically at the task of event detection. To do so, we make use of existing state-of-the-art event detection approaches and Wikipedia to generate a set of candidate events with associated tweets. We then use crowdsourcing to gather relevance judgements, and discuss the quality of results, including how we ensured integrity and prevented spam. As a result of this process, along with our Twitter corpus, we release relevance judgements containing over 150,000 tweets, covering more than 500 events, which can be used for the evaluation of event detection approaches.