Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp014j03d233b
Title: Thompson Sampling for Bandit Problems
Authors: LIU, CHE-YU
Advisors: Bubeck, Sebastien
Contributors: Operations Research and Financial Engineering Department
Keywords: Bandit Problems
Bayesian Algorithms
Bounded Regrets
Exploration-Exploitation Tradeoff
Pure Exploration
Thompson Sampling
Subjects: Artificial intelligence
Computer science
Statistics
Issue Date: 2018
Publisher: Princeton, NJ : Princeton University
Abstract: Bandit problems are the most basic examples of the sequential decision making problems with limited feedback and an exploitation/exploration trade-off. In these problems, an agent repeatedly selects an action out of a pool of candidates and receives a reward sampled from the action's reward distribution. The agent's goal is to maximize the sum of rewards that he receives over time. The trade-off at each time step is between exploitation of the actions that already produced high rewards in the past and exploration of the poorly understood actions that have the potential to yield even higher rewards in the future. Bandit problems arise naturally in many applications, such as clinical trials, project management and online news recommendation. Thompson Sampling is a popular strategy to solve bandit problems. It selects actions using the "probability matching" principle. In the first part of this thesis, we analyze Thompson Sampling from several different angles. First, we prove a tight bound on Thompson Sampling's performance when the performance is averaged with respect to the prior distribution that Thompson Sampling uses as input. Next, we look at the more realistic non-averaged performance of Thompson Sampling. We quantify the sensitivity of Thompson Sampling's (non-averaged) performance to the choice of input prior, by providing matching upper and lower bounds. Finally, we illustrate Thompson Sampling's ability to optimally exploit prior knowledge by thoroughly analyzing its behavior in a non-trivial example. In the second part of this thesis, we switch our focus to the most-correlated-arms identification problem, where the actions' reward distributions are assumed to be jointly Gaussian and the goal is to find actions with the most mutually correlated rewards. In this problem and unlike in bandit problems, we focus on exploring the actions to acquire as much relevant information as possible and only exploit the acquired information at the end to return the set of correlated actions. We propose two adaptive action-selection strategies and show that they can have significant advantages over the non-adaptive uniform sampling strategy. Our proposed algorithms rely on a novel correlation estimator. The use of this accurate estimator allows us to get improved results for a wide range of problem instances.
URI: http://arks.princeton.edu/ark:/88435/dsp014j03d233b
Alternate format: The Mudd Manuscript Library retains one bound copy of each dissertation. Search for these copies in the library's main catalog: catalog.princeton.edu
Type of Material: Academic dissertations (Ph.D.)
Language: en
Appears in Collections:Operations Research and Financial Engineering

Files in This Item:
File Description SizeFormat 
LIU_princeton_0181D_12387.pdf868.77 kBAdobe PDFView/Download


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.