Online Learning for Personalized Room-Level Thermal Control: A Multi-Armed Bandit Framework

Authors:
Parisa Mansourifard;Farrokh Jazizadeh;Bhaskar Krishnamachari;Burcin Becerik-Gerber
Affiliations:
Ming Hsieh Dept. of Electrical Engineering, University of Southern California, Los Angeles, CA, USA;Sony Astani Dept. of Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, USA;Ming Hsieh Dept. of Electrical Engineering, University of Southern California, Los Angeles, CA, USA;Sony Astani Dept. of Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, USA
Venue:
Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings
Year:
2013

Citing 3
Cited 0

Finite-time Analysis of the Multiarmed Bandit Problem

Machine Learning
A framework of energy efficient mobile sensing for automatic user state recognition

Proceedings of the 7th international conference on Mobile systems, applications, and services
Combinatorial network optimization with unknown variables: multi-armed bandits with linear rewards and individual observations

IEEE/ACM Transactions on Networking (TON)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of automatically learning the optimal thermal control in a room in order to maximize the expected average satisfaction among occupants providing stochastic feedback on their comfort through a participatory sensing application. Not assuming any prior knowledge or modeling of user comfort, we first apply the classic UCB1 online learning policy for multi-armed bandits (MAB), that combines exploration (testing out certain temperatures to understand better the user preferences) with exploitation (spending more time setting temperatures that maximize average-satisfaction) for the case when the total occupancy is constant. When occupancy is time-varying, the number of possible scenarios (i.e., which particular set of occupants are present in the room) becomes exponentially large, posing a combinatorial challenge. However, we show that LLR, a recently-developed combinatorial MAB online learning algorithm that requires recording and computation of only a polynomial number of quantities can be applied to this setting, yielding a regret (cumulative gap in average satisfaction with respect to a distribution aware genie) that grows only polynomially in the number of users, and logarithmically with time. This in turn indicates that difference in unit-time satisfaction obtained by the learning policy compared to the optimal tends to 0. We quantify the performance of these online learning algorithms using real data collected from users of a participatory sensing iPhone app in a multi-occupancy room in an office building in Southern California.