Online Learning for Personalized Room-Level Thermal Control: A Multi-Armed Bandit Framework

  • Authors:
  • Parisa Mansourifard;Farrokh Jazizadeh;Bhaskar Krishnamachari;Burcin Becerik-Gerber

  • Affiliations:
  • Ming Hsieh Dept. of Electrical Engineering, University of Southern California, Los Angeles, CA, USA;Sony Astani Dept. of Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, USA;Ming Hsieh Dept. of Electrical Engineering, University of Southern California, Los Angeles, CA, USA;Sony Astani Dept. of Civil and Environmental Engineering, University of Southern California, Los Angeles, CA, USA

  • Venue:
  • Proceedings of the 5th ACM Workshop on Embedded Systems For Energy-Efficient Buildings
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

We consider the problem of automatically learning the optimal thermal control in a room in order to maximize the expected average satisfaction among occupants providing stochastic feedback on their comfort through a participatory sensing application. Not assuming any prior knowledge or modeling of user comfort, we first apply the classic UCB1 online learning policy for multi-armed bandits (MAB), that combines exploration (testing out certain temperatures to understand better the user preferences) with exploitation (spending more time setting temperatures that maximize average-satisfaction) for the case when the total occupancy is constant. When occupancy is time-varying, the number of possible scenarios (i.e., which particular set of occupants are present in the room) becomes exponentially large, posing a combinatorial challenge. However, we show that LLR, a recently-developed combinatorial MAB online learning algorithm that requires recording and computation of only a polynomial number of quantities can be applied to this setting, yielding a regret (cumulative gap in average satisfaction with respect to a distribution aware genie) that grows only polynomially in the number of users, and logarithmically with time. This in turn indicates that difference in unit-time satisfaction obtained by the learning policy compared to the optimal tends to 0. We quantify the performance of these online learning algorithms using real data collected from users of a participatory sensing iPhone app in a multi-occupancy room in an office building in Southern California.