Regime Switching Bandits
Refereed conference paper presented and published in conference proceedings


Full Text

Times Cited

Other information
AbstractWe study a multi-armed bandit problem where the rewards exhibit regime switching.
Specifically, the distributions of the random rewards generated from all arms are
modulated by a common underlying state modeled as a finite-state Markov chain.
The agent does not observe the underlying state and has to learn the transition
matrix and the reward distributions. We propose a learning algorithm for this
problem, building on spectral method-of-moments estimations for hidden Markov
models, belief error control in partially observable Markov decision processes
and upper-confidence-bound methods for online learning. We also establish an
upper bound O(T^{2/3} \sqrt{log T}) for the proposed learning algorithm where T is the
learning horizon. Finally, we conduct proof-of-concept experiments to illustrate
the performance of the learning algorithm.
All Author(s) ListXiang Zhou, Yi Xiong , Ningyuan Chen , Xuefeng Gao
Name of Conference35th Conference on Neural Information Processing Systems (NeurIPS 2021)
Start Date of Conference07/12/2021
End Date of Conference10/12/2021
Place of ConferenceVirtual
Country/Region of ConferenceUnited States of America
Proceedings TitleAdvances in Neural Information Processing Systems
Year2021
Volume Number6
Pages4542 - 4554
LanguagesEnglish-United States

Last updated on 2024-21-08 at 01:54