Motivated by applications in e-commerce and revenue management, we consider multiarmed bandit problems involving a large (possibly infinite) collection of arms, in which the expected reward of each arm is a linear function of an unknown univariate or multivariate random variable. Our proposed model also has potential applications in product design and health care. The objective is to choose a sequence of arms to minimize the cumulative regret and Bayes risk. For the univariate case, we find that a greedy policy suffices. For the multivariate case, when the set of arms is strongly convex, we provide a lower bound and a policy that results in a matching upper bound. In the absence of strong convexity, we describe a policy whose performance is within a polylogarithmic factor from the optimal.
This is a joint work with John Tsitsiklis and Adam Mersereau.