Kullback–Leibler Upper Confidence Bound
In multi-armed bandit problems, KL-UCB (for Kullback–Leibler Upper Confidence Bound)[1] is a type of UCB-type algorithm that is asymptotically optimal, in the sense that its regret matches the problem-dependent lower bound derived by Lai and Robbins.[2]
Multi-armed bandit problem
The Multi-armed bandit problem is a sequential game where one player has to choose at each turn between actions (arms). Behind every arm there is an unknown distribution that lies in a set known by the player (for example, can be the set of Gaussian distributions or Bernoulli distributions).
At each turn the player chooses (pulls) an arm , he then gets an observation of the distribution .
Regret minimization
The goal is to minimize the regret at time that is defined as
where
- is the mean of arm
- is the highest mean
- is the number of pulls of arm up to turn
The player has to find an algorithm that chooses at each turn which arm to pull based on the previous actions and observations to minimize the regret .
This is a trade-off problem between exploration to find the best arm (the arm with the highest mean) and exploitation to play as much as possible the arm that we think is the best arm.[3]
Applications
Multi-armed bandit algorithms are used in a variety of fields; for example, they have applications in clinical trials, recommender systems, telecommunications[4], and precision agriculture.[5]
Algorithm KL-UCB
The algorithm is a UCB-type algorithm based on optimism, which means that at each turn we compute an upper confidence bound (UCB) for the mean of each arm ; we then pull the arm with the highest UCB.
The difference with KL-UCB is that it uses an estimation of the lower bound of Lai–Robbins[2] to make the upper confidence bound.[1]
History
The algorithm was first introduced in 2011 for Bernoulli distribution.[1] It was then extended to one-dimensional exponential families and bounded distributions in 2013.[6] An adaptation called KL-UCB-Switch, which uses a mix of MOSS[7] and KL-UCB, was developed to obtain both the problem-dependent and problem-independent asymptotic lower bounds in 2022.[8] The algorithm was also extended to Lipschitz bandits in 2014.[9]
Formal algorithm
At first, the algorithm pulls all the arms once. Then, for each turn , for each arm , we compute:
where
- is the Kullback–Leibler divergence
- is the empirical distribution of arm at turn
- is a well-chosen sequence of positive numbers, often equal to with .[6]
Then we choose the arm with the highest index:
We note that the algorithm does not require knowledge of .
Example
In the special case of Gaussian distribution with fixed variance , we have:
with being the empirical mean of arm at turn .
Pseudocode
The player gets the set D
for each arm i do:
n[i] ← 1; nu[i] ← None; d ← ln(K)
for t from 1 to K do:
select arm t
observe reward r
n[t] ← n[t] + 1
nu[t] ← update empirical distribution
for t from K+1 to T do:
for each arm i do:
index[i] ← compute_index(n[i], nu[i], D, d)
select arm a with highest index[a]
observe reward r
n[a] ← n[a] + 1
nu[a] ← update empirical distribution
d ← ln(t+1)
Theoretical results
In the multi-armed bandit problem we have the Lai–Robbins[2] asymptotic lower bound on regret. The algorithm KL-UCB matches this lower bound for one-dimensional exponential families with and for distributions bounded in with .[6]
Lai–Robbins lower bound
In 1952 Lai and Robbins proved an asymptotic, problem-dependent lower bound on regret.
It states that for every consistent algorithm on the set — that is, an algorithm for which, for every , the regret is subpolynomial (i.e. for all ) — we have:
This bound is asymptotic (as ) and gives a first-order lower bound of order with the optimal constant in front of it.
Regret bound for KL-UCB
The algorithm matches the Lai–Robbins[2] lower bound for one-dimensional exponential-family distributions and for distributions bounded in .[6]
One-dimensional exponential family
For being the set of one-dimensional exponential families, with we have the following upper bound on the regret of KL-UCB:[6]
Bounded distributions in [0,1]
For (the set of distributions supported on ), and for , we have the following upper bound on the regret of KL-UCB:[6]
Runtime
For , the runtime needed per step and for an arm with observations is .[10] This is higher than that of other optimal algorithms, such as NPTS[11] with .[10] MED[12] with .[10] and IMED[13] with .[10]
The high runtime of KL-UCB is due to a two-level optimisation: for each arm and candidate mean , the algorithm evaluates and then maximises subject to . For distributions bounded in the inner problem has no closed form and must be solved numerically, which increases the per-step cost.[12][6]
See also
References
- ^ a b c Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences". In Kakade, Sham M.; von Luxburg, Ulrike (eds.). Proceedings of the 24th Annual Conference on Learning Theory. Proceedings of Machine Learning Research. Vol. 19. Budapest, Hungary: PMLR. pp. 497–514.
- ^ a b c d Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.
- ^ Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge: Cambridge University Press.
- ^ Bouneffouf, Djallel; Rish, Irina (2019). "A survey on practical applications of multi-armed and contextual bandits". arXiv:1904.10040 [cs.LG].
- ^ Gautron, Romain; Baudry, Dorian; Adam, Myriam; Falconnier, Gatien N; Hoogenboom, Gerrit; King, Brian; Corbeels, Marc (2024). "A new adaptive identification strategy of best crop management with farmers". Field Crops Research. 307. Elsevier: 109249.
- ^ a b c d e f g Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2013). "Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation". The Annals of Statistics: 1516–1541.
- ^ Audibert, Jean-Yves; Bubeck, Sébastien (2009). Minimax policies for adversarial and stochastic bandits. Proceedings of the 22nd Annual Conference on Learning Theory (COLT). pp. 217--226.
- ^ Garivier, Aurélien; Hadiji, Hédi; Ménard, Pierre; Stoltz, Gilles (2022). "KL-UCB-switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints". Journal of Machine Learning Research. 23 (179): 1–66.
- ^ Magureanu, Stefan; Combes, Richard; Proutière, Alexandre (2014). "Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms". arXiv:1405.4758 [cs.LG].
- ^ a b c d Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.
- ^ Riou, Charles; Honda, Junya (2020). "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions". In Kontorovich, Aryeh; Neu, Gergely (eds.). Proceedings of the 31st International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research. Vol. 117. PMLR. pp. 777–826.
- ^ a b Honda, Junya; Takemura, Akimichi (2010). "An Asymptotically Optimal Bandit Algorithm for Bounded Support Models". COLT. pp. 67–79.
- ^ Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.