Scaling pair-wise similarity-based algorithms in tagging spaces

Authors:
Damir Vandic;Flavius Frasincar;Frederik Hogenboom
Affiliations:
Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands;Erasmus University Rotterdam, Rotterdam, The Netherlands
Venue:
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Year:
2012

Citing 9
Cited 0

Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
The complex dynamics of collaborative tagging

Proceedings of the 16th international conference on World Wide Web
Python for Scientific Computing

Computing in Science and Engineering
Tag-based social interest discovery

Proceedings of the 17th international conference on World Wide Web
Integrating Folksonomies with the Semantic Web

ESWC '07 Proceedings of the 4th European conference on The Semantic Web: Research and Applications
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
PINTS: peer-to-peer infrastructure for tagging systems

IPTPS'08 Proceedings of the 7th international conference on Peer-to-peer systems
Improving the exploration of tag spaces using automated tag clustering

ICWE'11 Proceedings of the 11th international conference on Web engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

Users of Web tag spaces, e.g., Flickr, find it difficult to get adequate search results due to syntactic and semantic tag variations. In most approaches that address this problem, the cosine similarity between tags plays a major role. However, the use of this similarity introduces a scalability problem as the number of similarities that need to be computed grows quadratically with the number of tags. In this paper, we propose a novel algorithm that filters insignificant cosine similarities in linear time complexity with respect to the number of tags. Our approach shows a significant reduction in the number of calculations, which makes it possible to process larger tag data sets than ever before. To evaluate our approach, we used a data set containing 51 million pictures and 112 million tag annotations from Flickr.