Learning and incentives in user-generated content: multi-armed bandits with endogenous arms

  • Authors:
  • Arpita Ghosh;Patrick Hummel

  • Affiliations:
  • Cornell University, Ithaca, NY, USA;Google Inc., Mountain View, CA, USA

  • Venue:
  • Proceedings of the 4th conference on Innovations in Theoretical Computer Science
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Motivated by the problem of learning the qualities of user-generated content on the Web, we study a multi-armed bandit problem where the number and success probabilities of the arms of the bandit are endogenously determined by strategic agents in response to the incentives provided by the learning algorithm. We model the contributors of user-generated content as attention-motivated agents who derive benefit when their contribution is displayed, and have a cost to quality, where a contribution's quality is the probability of its receiving a positive viewer vote. Agents strategically choose whether and what quality contribution to produce in response to the algorithm that decides how to display contributions. The algorithm, which would like to eventually only display the highest quality contributions, can only learn a contribution's quality from the viewer votes the contribution receives when displayed. The problem of inferring the relative qualities of contributions using viewer feedback, to optimize for overall viewer satisfaction over time, can then be modeled as the classic multi-armed bandit problem, except that the arms available to the bandit and therefore the achievable regret are endogenously determined by strategic agents --- a good algorithm for this setting must not only quickly identify the best contributions, but also incentivize high-quality contributions to choose amongst in the first place. We first analyze the well-known UCB algorithm Ma [Auer et al. 2002] as a mechanism in this setting, where the total number of potential contributors or arms, K, can grow with the total number of viewers or available periods, T, and the maximum possible success probability of an arm, γ, may be bounded away from 1 to model malicious or error-prone viewers in the audience. We first show that while Ma can incentivize high-quality arms and achieve strong sublinear equilibrium regret when K(T) does not grow too quickly with T, it incentivizes very low quality contributions when K(T) scales proportionally with T. We then show that modifying the UCB mechanism to explore a randomly chosen restricted subset of √{T} arms provides excellent incentive properties --- this modified mechanism achieves strong sublinear regret, which is the regret measured against the maximum achievable quality γ, in every equilibrium, for all ranges of K(T) ≤ T, for all possible values of the audience parameter $\gamma$.