Finite-time Analysis of the Multiarmed Bandit Problem
Machine Learning
A framework of energy efficient mobile sensing for automatic user state recognition
Proceedings of the 7th international conference on Mobile systems, applications, and services
IEEE/ACM Transactions on Networking (TON)
Hi-index | 0.00 |
We consider the problem of automatically learning the optimal thermal control in a room in order to maximize the expected average satisfaction among occupants providing stochastic feedback on their comfort through a participatory sensing application. Not assuming any prior knowledge or modeling of user comfort, we first apply the classic UCB1 online learning policy for multi-armed bandits (MAB), that combines exploration (testing out certain temperatures to understand better the user preferences) with exploitation (spending more time setting temperatures that maximize average-satisfaction) for the case when the total occupancy is constant. When occupancy is time-varying, the number of possible scenarios (i.e., which particular set of occupants are present in the room) becomes exponentially large, posing a combinatorial challenge. However, we show that LLR, a recently-developed combinatorial MAB online learning algorithm that requires recording and computation of only a polynomial number of quantities can be applied to this setting, yielding a regret (cumulative gap in average satisfaction with respect to a distribution aware genie) that grows only polynomially in the number of users, and logarithmically with time. This in turn indicates that difference in unit-time satisfaction obtained by the learning policy compared to the optimal tends to 0. We quantify the performance of these online learning algorithms using real data collected from users of a participatory sensing iPhone app in a multi-occupancy room in an office building in Southern California.