Groundhog day: near-duplicate detection on Twitter

Authors:
Ke Tao;Fabian Abel;Claudia Hauff;Geert-Jan Houben;Ujwal Gadiraju
Affiliations:
Delft University of Technology, Delft, Netherlands;XING AG, Hamburg, Germany;Delft University of Technology, Delft, Netherlands;Delft University of Technology, Delft, Netherlands;Delft University of Technology, Delft, Netherlands
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 20
Cited 0

WordNet: a lexical database for English

Communications of the ACM
Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms

STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
An Information-Theoretic Definition of Similarity

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Finding near-duplicate web pages: a large-scale evaluation of algorithms

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling

Proceedings of the 16th international conference on World Wide Web
Diversifying search results

Proceedings of the Second ACM International Conference on Web Search and Data Mining
What is Twitter, a social network or a news media?

Proceedings of the 19th international conference on World wide web
Diversifying web search results

Proceedings of the 19th international conference on World wide web
Earthquake shakes Twitter users: real-time event detection by social sensors

Proceedings of the 19th international conference on World wide web
Finding influentials based on the temporal order of information adoption in twitter

Proceedings of the 19th international conference on World wide web
Eddi: interactive topic-based browsing of social status streams

UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
#TwitterSearch: a comparison of microblog search and web search

Proceedings of the fourth ACM international conference on Web search and data mining
We know who you followed last summer: inferring social link creation times in twitter

Proceedings of the 20th international conference on World wide web
we.b: the web of short urls

Proceedings of the 20th international conference on World wide web
Analyzing user modeling on twitter for personalized news recommendations

UMAP'11 Proceedings of the 19th international conference on User modeling, adaption, and personalization
DBpedia spotlight: shedding light on the web of documents

Proceedings of the 7th International Conference on Semantic Systems
Twitcident: fighting fire with information from social web streams

Proceedings of the 21st international conference companion on World Wide Web
On building a reusable Twitter corpus

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Twinder: a search engine for twitter streams

ICWE'12 Proceedings of the 12th international conference on Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.