Risk-Sensitive and Risk-Neutral Multi-Armed Bandits
Uriel G. Rothblum
Faculty of Industrial Engineering and Management
The Technion
Friday, October 27, 2006
3:15 - 4:15 PM
Terman Engineering Center, Room 453
Abstract:
The classic result for the multi-armed bandit problem is that each
state of each bandit (Markov chain with rewards) has an "index" that
depends only on the data for its bandit, and expected discounted
income is maximized by playing at each epoch a bandit whose current
state has the largest index. We extend this result to the cases of
risk-averse and risk-seeking exponential utility functions, while
generalizing it somewhat for the risk-neutral case. Our approach is
constructive and uses "domination" in place of indices. A simple
recursion assigns to each state of each bandit a "reward" and an
"amplification" that depend solely on the data for its bandit. These
rewards and amplifications determine a ranking of the states. We show
that it is optimal to play at each epoch any bandit whose state has
maximum ranking. The algorithm we develop is simpler than previous
algorithms that determine indices. (The talk is joint work of Eric
V. Denardo, Haechurl Park and Uriel G. Rothblum.)
Operations Research Colloquia: http://or.stanford.edu/oras_seminars.html