Risk-Sensitive and Risk-Neutral Multi-Armed Bandits

Uriel G. Rothblum
Faculty of Industrial Engineering and Management
The Technion


Friday, October 27, 2006
3:15 - 4:15 PM
Terman Engineering Center, Room 453


Abstract:

The classic result for the multi-armed bandit problem is that each state of each bandit (Markov chain with rewards) has an "index" that depends only on the data for its bandit, and expected discounted income is maximized by playing at each epoch a bandit whose current state has the largest index. We extend this result to the cases of risk-averse and risk-seeking exponential utility functions, while generalizing it somewhat for the risk-neutral case. Our approach is constructive and uses "domination" in place of indices. A simple recursion assigns to each state of each bandit a "reward" and an "amplification" that depend solely on the data for its bandit. These rewards and amplifications determine a ranking of the states. We show that it is optimal to play at each epoch any bandit whose state has maximum ranking. The algorithm we develop is simpler than previous algorithms that determine indices. (The talk is joint work of Eric V. Denardo, Haechurl Park and Uriel G. Rothblum.)




Operations Research Colloquia: http://or.stanford.edu/oras_seminars.html