Measuring the validity of peer-to-peer data for information retrieval applications

Authors:
Noam Koenigstein;Yuval Shavitt;Ela Weinsberg;Udi Weinsberg
Affiliations:
School of Electrical Engineering, Tel-Aviv University, Israel;School of Electrical Engineering, Tel-Aviv University, Israel;Dept. of Industrial Engineering, Tel-Aviv University, Israel;School of Electrical Engineering, Tel-Aviv University, Israel
Venue:
Computer Networks: The International Journal of Computer and Telecommunications Networking
Year:
2012

Citing 17
Cited 1

Recommender systems

Communications of the ACM
On power-law relationships of the Internet topology

Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication
The String-to-String Correction Problem

Journal of the ACM (JACM)
Methods and metrics for cold-start recommendations

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Mapping the Gnutella Network

IEEE Internet Computing
Evaluating collaborative filtering recommender systems

ACM Transactions on Information Systems (TOIS)
Characterizing the query behavior in peer-to-peer file sharing systems

Proceedings of the 4th ACM SIGCOMM conference on Internet measurement
Characterizing the two-tier gnutella topology

SIGMETRICS '05 Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures

Computer Music Journal
Content-based multimedia information retrieval: State of the art and challenges

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP)
Whatever happened to payola? an empirical analysis of online music sharing

Decision Support Systems
Spotting out emerging artists using geo-aware analysis of P2P query strings

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Monitoring the Bittorrent Monitors: A Bird's Eye View

PAM '09 Proceedings of the 10th International Conference on Passive and Active Network Measurement
Using P2P sharing activity to improve business decision making: proof of concept for estimating product life-cycle

Electronic Commerce Research and Applications
From hits to niches?: or how popular artists can bias music recommendation and discovery

Proceedings of the 2nd KDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition
Mining Music from Large-Scale, Peer-to-Peer Networks

IEEE MultiMedia
Empirical analysis of predictive algorithms for collaborative filtering

UAI'98 Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

Editorial: Editorial for Computer Networks special issue on "Measurement-based optimization of P2P networking and applications"

Computer Networks: The International Journal of Computer and Telecommunications Networking

Quantified Score

Hi-index	0.00

Visualization

Abstract

Peer-to-peer (p2p) networks are being increasingly adopted as an invaluable resource for various information retrieval (IR) applications, including similarity estimation, content recommendation and trend prediction. However, these networks are usually extremely large and noisy, which raises doubts regarding the ability to actually extract sufficiently accurate information. This paper quantifies the measurement effort required to obtain and optimize the information obtained from p2p networks for the purpose of IR applications. We identify and measure inherent difficulties in collecting p2p data, namely, partial crawling, user-generated noise, sparseness, and popularity and localization of content and search queries. These aspects are quantified using music files shared in the Gnutella p2p network. We show that the power-law nature of the network makes it relatively easy to capture an accurate view of the popular content using relatively little effort. However, some applications, like trend prediction, mandate collection of the data from the ''long tail'', hence a much more exhaustive crawl is needed. Furthermore, we show that content and search queries are highly localized, indicating that location-crossing conclusions require a wide spread spatial crawl. Finally, we present techniques for overcoming noise originating from user generated content and for filtering non-informative data, while minimizing information loss.