Communications of the ACM
On power-law relationships of the Internet topology
Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
The String-to-String Correction Problem
Journal of the ACM (JACM)
Methods and metrics for cold-start recommendations
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
IEEE Internet Computing
Evaluating collaborative filtering recommender systems
ACM Transactions on Information Systems (TOIS)
Characterizing the query behavior in peer-to-peer file sharing systems
Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Characterizing the two-tier gnutella topology
SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures
Computer Music Journal
Content-based multimedia information retrieval: State of the art and challenges
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Whatever happened to payola? an empirical analysis of online music sharing
Decision Support Systems
Spotting out emerging artists using geo-aware analysis of P2P query strings
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Monitoring the Bittorrent Monitors: A Bird's Eye View
PAM '09 Proceedings of the 10th International Conference on Passive and Active Network Measurement
Electronic Commerce Research and Applications
From hits to niches?: or how popular artists can bias music recommendation and discovery
Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition
Mining Music from Large-Scale, Peer-to-Peer Networks
IEEE MultiMedia
Empirical analysis of predictive algorithms for collaborative filtering
UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
Computer Networks: The International Journal of Computer and Telecommunications Networking
Hi-index | 0.00 |
Peer-to-peer (p2p) networks are being increasingly adopted as an invaluable resource for various information retrieval (IR) applications, including similarity estimation, content recommendation and trend prediction. However, these networks are usually extremely large and noisy, which raises doubts regarding the ability to actually extract sufficiently accurate information. This paper quantifies the measurement effort required to obtain and optimize the information obtained from p2p networks for the purpose of IR applications. We identify and measure inherent difficulties in collecting p2p data, namely, partial crawling, user-generated noise, sparseness, and popularity and localization of content and search queries. These aspects are quantified using music files shared in the Gnutella p2p network. We show that the power-law nature of the network makes it relatively easy to capture an accurate view of the popular content using relatively little effort. However, some applications, like trend prediction, mandate collection of the data from the ''long tail'', hence a much more exhaustive crawl is needed. Furthermore, we show that content and search queries are highly localized, indicating that location-crossing conclusions require a wide spread spatial crawl. Finally, we present techniques for overcoming noise originating from user generated content and for filtering non-informative data, while minimizing information loss.