WordNet: a lexical database for English
Communications of the ACM
Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Similarity estimation techniques from rounding algorithms
STOC '02 Proceedings of the thiry-fourth annual ACM symposium on Theory of computing
An Information-Theoretic Definition of Similarity
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Finding near-duplicate web pages: a large-scale evaluation of algorithms
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting near-duplicates for web crawling
Proceedings of the 16th international conference on World Wide Web
Proceedings of the Second ACM International Conference on Web Search and Data Mining
What is Twitter, a social network or a news media?
Proceedings of the 19th international conference on World wide web
Diversifying web search results
Proceedings of the 19th international conference on World wide web
Earthquake shakes Twitter users: real-time event detection by social sensors
Proceedings of the 19th international conference on World wide web
Finding influentials based on the temporal order of information adoption in twitter
Proceedings of the 19th international conference on World wide web
Eddi: interactive topic-based browsing of social status streams
UIST '10 Proceedings of the 23nd annual ACM symposium on User interface software and technology
#TwitterSearch: a comparison of microblog search and web search
Proceedings of the fourth ACM international conference on Web search and data mining
We know who you followed last summer: inferring social link creation times in twitter
Proceedings of the 20th international conference on World wide web
Proceedings of the 20th international conference on World wide web
Analyzing user modeling on twitter for personalized news recommendations
UMAP'11 Proceedings of the 19th international conference on User modeling, adaption, and personalization
DBpedia spotlight: shedding light on the web of documents
Proceedings of the 7th International Conference on Semantic Systems
Twitcident: fighting fire with information from social web streams
Proceedings of the 21st international conference companion on World Wide Web
On building a reusable Twitter corpus
SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval
Twinder: a search engine for twitter streams
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Hi-index | 0.00 |
With more than 340~million messages that are posted on Twitter every day, the amount of duplicate content as well as the demand for appropriate duplicate detection mechanisms is increasing tremendously. Yet there exists little research that aims at detecting near-duplicate content on microblogging platforms. We investigate the problem of near-duplicate detection on Twitter and introduce a framework that analyzes the tweets by comparing (i) syntactical characteristics, (ii) semantic similarity, and (iii) contextual information. Our framework provides different duplicate detection strategies that, among others, make use of external Web resources which are referenced from microposts. Machine learning is exploited in order to learn patterns that help identifying duplicate content. We put our duplicate detection framework into practice by integrating it into Twinder, a search engine for Twitter streams. An in-depth analysis shows that it allows Twinder to diversify search results and improve the quality of Twitter search. We conduct extensive experiments in which we (1) evaluate the quality of different strategies for detecting duplicates, (2) analyze the impact of various features on duplicate detection, (3) investigate the quality of strategies that classify to what exact level two microposts can be considered as duplicates and (4) optimize the process of identifying duplicate content on Twitter. Our results prove that semantic features which are extracted by our framework can boost the performance of detecting duplicates.