On building a reusable Twitter corpus

Authors:
Richard McCreadie;Ian Soboroff;Jimmy Lin;Craig Macdonald;Iadh Ounis;Dean McCullough
Affiliations:
University of Glasgow, Glasgow, United Kingdom;National Institute of Science and Technology, Gaithersburg, MD, USA;University of Maryland, College Park, MD, USA;University of Glasgow, Glasgow, United Kingdom;University of Glasgow, Glasgow, United Kingdom;National Institute of Science and Technology, Gaithersburg, MD, USA
Venue:
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Year:
2012

Citing 1
Cited 7

Earlybird: Real-Time Search at Twitter

ICDE '12 Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

Information Retrieval on the Blogosphere

Foundations and Trends in Information Retrieval
Subjectivity annotation of the microblog 2011 realtime adhoc relevance judgments

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Hybrid retrieval approaches to geospatial music recommendation

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Groundhog day: near-duplicate detection on Twitter

Proceedings of the 22nd international conference on World Wide Web
Relevance in microblogs: enhancing tweet retrieval using hyperlinked documents

Proceedings of the 10th Conference on Open Research Areas in Information Retrieval
Fast candidate generation for real-time tweet search with bloom filter chains

ACM Transactions on Information Systems (TOIS)
Building a large-scale corpus for evaluating event detection on twitter

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Twitter real-time information network is the subject of research for information retrieval tasks such as real-time search. However, so far, reproducible experimentation on Twitter data has been impeded by restrictions imposed by the Twitter terms of service. In this paper, we detail a new methodology for legally building and distributing Twitter corpora, developed through collaboration between the Text REtrieval Conference (TREC) and Twitter. In particular, we detail how the first publicly available Twitter corpus - referred to as Tweets2011 - was distributed via lists of tweet identifiers and specialist tweet crawling software. Furthermore, we analyse whether this distribution approach remains robust over time, as tweets in the corpus are removed either by users or Twitter itself. Tweets2011 was successfully used by 58 participating groups for the TREC 2011 Microblog track, while our results attest to the robustness of the crawling methodology over time.