Bandit Problems for Pairwise Feedback Models
A multiarmed bandit problem is a typical example of a dilemma between exploration and exploitation. In this problem, a gambler adaptively chooses the arm to pull based on the past rewards of the pulled arms so that the cumulative reward is maximized or the arm with the maximum reward can be estimated. Whereas this model is useful in many applications such as online advertisement, there are some cases that the reward (or utility) is hard to observe directly and only the feedback from pairwise comparisons is available. We introduce some problem formulations in this model such as dueling bandits and discuss efficient algorithms for them.