Kullback–Leibler Upper Confidence Bound

In multi-armed bandit problems, KL-UCB (for Kullback–Leibler Upper Confidence Bound)^[1] is a type of UCB-type algorithm that is asymptotically optimal, in the sense that its regret matches the problem-dependent lower bound derived by Lai and Robbins.^[2]

Multi-armed bandit problem

The Multi-armed bandit problem is a sequential game where one player has to choose at each turn between $K$ actions (arms). Behind every arm $a$ there is an unknown distribution $\nu _{a}$ that lies in a set ${\mathcal {D}}$ known by the player (for example, ${\mathcal {D}}$ can be the set of Gaussian distributions or Bernoulli distributions).

At each turn $t$ the player chooses (pulls) an arm $a_{t}$ , he then gets an observation $X_{t}$ of the distribution $\nu _{a_{t}}$ .

Regret minimization

The goal is to minimize the regret at time $T$ that is defined as

R_{T}:=\sum _{a=1}^{K}\Delta _{a}\mathbb {E} [N_{a}(T)]

where

$\mu _{a}:=\mathbb {E} [\nu _{a}]$ is the mean of arm $a$
$\mu ^{*}:=\max _{a}\mu _{a}$ is the highest mean
$\Delta _{a}:=\mu ^{*}-\mu _{a}$
$N_{a}(t)$ is the number of pulls of arm $a$ up to turn $t$

The player has to find an algorithm that chooses at each turn $t$ which arm to pull based on the previous actions and observations $(a_{s},X_{s})_{s<t}$ to minimize the regret $R_{T}$ .

This is a trade-off problem between exploration to find the best arm (the arm with the highest mean) and exploitation to play as much as possible the arm that we think is the best arm.^[3]

Applications

Multi-armed bandit algorithms are used in a variety of fields; for example, they have applications in clinical trials, recommender systems, telecommunications^[4], and precision agriculture.^[5]

Algorithm KL-UCB

The algorithm is a UCB-type algorithm based on optimism, which means that at each turn $t$ we compute an upper confidence bound (UCB) for the mean of each arm $a$ ; we then pull the arm with the highest UCB.

The difference with KL-UCB is that it uses an estimation of the lower bound of Lai–Robbins^[2] to make the upper confidence bound.^[1]

History

The algorithm was first introduced in 2011 for Bernoulli distribution.^[1] It was then extended to one-dimensional exponential families and bounded distributions in 2013.^[6] An adaptation called KL-UCB-Switch, which uses a mix of MOSS^[7] and KL-UCB, was developed to obtain both the problem-dependent and problem-independent asymptotic lower bounds in 2022.^[8] The algorithm was also extended to Lipschitz bandits in 2014.^[9]

Formal algorithm

At first, the algorithm pulls all the arms once. Then, for each turn $t\geq K+1$ , for each arm $a$ , we compute:

U_{a}(t):=\max \left\{\mu \ |\ N_{a}(t){\mathcal {K}}_{inf}({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})\leq \delta _{t}\right\}

where

${\mathcal {K}}_{inf}(\nu ,\mu ,{\mathcal {D}}):=\inf \left\{\mathrm {KL} (\nu ,{\tilde {\nu }})\ |\ {\tilde {\nu }}\in {\mathcal {D}},\ \mathbb {E} [{\tilde {\nu }}]>\mu \right\}$
$\mathrm {KL}$ is the Kullback–Leibler divergence
${\hat {\nu }}_{a}(t)$ is the empirical distribution of arm $a$ at turn $t$
$\delta _{t}$ is a well-chosen sequence of positive numbers, often equal to $\ln t+c\ln \ln t$ with $c>0$ .^[6]

Then we choose the arm $a_{t}$ with the highest index:

a_{t}:=\arg \max _{a}U_{a}(t)

We note that the algorithm does not require knowledge of $T$ .

Example

In the special case of Gaussian distribution with fixed variance $\sigma ^{2}$ , we have:

U_{a}(t)={\hat {\mu }}_{a}(t)+{\sqrt {\frac {2\sigma ^{2}\delta _{t}}{N_{a}(t)}}}

with ${\hat {\mu }}_{a}(t)$ being the empirical mean of arm $a$ at turn $t$ .

Pseudocode

The player gets the set D
for each arm i do:
    n[i] ← 1; nu[i] ← None; d ← ln(K)
for t from 1 to K do:
    select arm t
    observe reward r
    n[t] ← n[t] + 1
    nu[t] ← update empirical distribution
for t from K+1 to T do:
    for each arm i do:
        index[i] ← compute_index(n[i], nu[i], D, d)
    select arm a with highest index[a]
    observe reward r
    n[a] ← n[a] + 1
    nu[a] ← update empirical distribution
    d ← ln(t+1)

Theoretical results

In the multi-armed bandit problem we have the Lai–Robbins^[2] asymptotic lower bound on regret. The algorithm KL-UCB matches this lower bound for one-dimensional exponential families with $\delta _{t}:=\ln t+3\ln \ln t$ and for distributions bounded in $[0,1]$ with $\delta _{t}:=\ln t+\ln \ln t$ .^[6]

Lai–Robbins lower bound

In 1952 Lai and Robbins proved an asymptotic, problem-dependent lower bound on regret.

It states that for every consistent algorithm on the set ${\mathcal {D}}$ — that is, an algorithm for which, for every $(\nu _{1},\dots ,\nu _{K})\in {\mathcal {D}}^{K}$ , the regret $R_{T}$ is subpolynomial (i.e. $R_{T}=o_{T\to +\infty }(T^{\alpha })$ for all $\alpha >0$ ) — we have:

R_{T}\geq \left(\sum _{a:\mu _{a}<\mu ^{*}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*},{\mathcal {D}})}}\right)\ln T+o_{T\to +\infty }(\ln T).

This bound is asymptotic (as $T\to +\infty$ ) and gives a first-order lower bound of order $\ln T$ with the optimal constant in front of it.

Regret bound for KL-UCB

The algorithm matches the Lai–Robbins^[2] lower bound for one-dimensional exponential-family distributions and for distributions bounded in $[0,1]$ .^[6]

One-dimensional exponential family

For ${\mathcal {D}}$ being the set of one-dimensional exponential families, with $\delta _{t}:=\ln t+3\ln \ln t$ we have the following upper bound on the regret of KL-UCB:^[6]

R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{*}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*},{\mathcal {D}})}}\right)\ln T+O_{T}({\sqrt {\ln T}}).

Bounded distributions in [0,1]

For ${\mathcal {D}}={\mathcal {P}}([0,1])$ (the set of distributions supported on $[0,1]$ ), and for $\delta _{t}:=\ln t+\ln \ln t$ , we have the following upper bound on the regret of KL-UCB:^[6]

R_{T}\leq \left(\sum _{a:\mu _{a}<\mu ^{*}}{\frac {\Delta _{a}}{{\mathcal {K}}_{\inf }(\nu _{a},\mu ^{*},{\mathcal {D}})}}\right)\ln T+O_{T}{\big (}(\ln T)^{4/5}\ln \ln T{\big )}.

Runtime

For ${\mathcal {D}}={\mathcal {P}}([0,1])$ , the runtime needed per step and for an arm $k$ with $n$ observations is ${\mathcal {O}}{\big (}n(\ln n)^{2}{\big )}$ .^[10] This is higher than that of other optimal algorithms, such as NPTS^[11] with ${\mathcal {O}}(n)$ .^[10] MED^[12] with ${\mathcal {O}}(n\ln n)$ .^[10] and IMED^[13] with ${\mathcal {O}}(n\ln n)$ .^[10]

The high runtime of KL-UCB is due to a two-level optimisation: for each arm and candidate mean $\mu$ , the algorithm evaluates ${\mathcal {K}}_{\inf }({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})$ and then maximises $\mu$ subject to $N_{a}(t)\,{\mathcal {K}}_{\inf }({\hat {\nu }}_{a}(t),\mu ,{\mathcal {D}})\leq \delta _{t}$ . For distributions bounded in $[0,1]$ the inner problem has no closed form and must be solved numerically, which increases the per-step cost.^[12]^[6]

References

^ ^a ^b ^c Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences". In Kakade, Sham M.; von Luxburg, Ulrike (eds.). Proceedings of the 24th Annual Conference on Learning Theory. Proceedings of Machine Learning Research. Vol. 19. Budapest, Hungary: PMLR. pp. 497–514.
^ ^a ^b ^c ^d Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.
^ Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge: Cambridge University Press.
^ Bouneffouf, Djallel; Rish, Irina (2019). "A survey on practical applications of multi-armed and contextual bandits". arXiv:1904.10040 [cs.LG].
^ Gautron, Romain; Baudry, Dorian; Adam, Myriam; Falconnier, Gatien N; Hoogenboom, Gerrit; King, Brian; Corbeels, Marc (2024). "A new adaptive identification strategy of best crop management with farmers". Field Crops Research. 307. Elsevier: 109249.
^ ^a ^b ^c ^d ^e ^f ^g Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2013). "Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation". The Annals of Statistics: 1516–1541.
^ Audibert, Jean-Yves; Bubeck, Sébastien (2009). Minimax policies for adversarial and stochastic bandits. Proceedings of the 22nd Annual Conference on Learning Theory (COLT). pp. 217--226.
^ Garivier, Aurélien; Hadiji, Hédi; Ménard, Pierre; Stoltz, Gilles (2022). "KL-UCB-switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints". Journal of Machine Learning Research. 23 (179): 1–66.
^ Magureanu, Stefan; Combes, Richard; Proutière, Alexandre (2014). "Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms". arXiv:1405.4758 [cs.LG].
^ ^a ^b ^c ^d Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.
^ Riou, Charles; Honda, Junya (2020). "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions". In Kontorovich, Aryeh; Neu, Gergely (eds.). Proceedings of the 31st International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research. Vol. 117. PMLR. pp. 777–826.
^ ^a ^b Honda, Junya; Takemura, Akimichi (2010). "An Asymptotically Optimal Bandit Algorithm for Bounded Support Models". COLT. pp. 67–79.
^ Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.

[Maillard2011-1] Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2011). "A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler Divergences". In Kakade, Sham M.; von Luxburg, Ulrike (eds.). Proceedings of the 24th Annual Conference on Learning Theory. Proceedings of Machine Learning Research. Vol. 19. Budapest, Hungary: PMLR. pp. 497–514.

[Lai1985-2] Lai, T.L.; Robbins, Herbert (1985). "Asymptotically Efficient Adaptive Allocation Rules". Advances in Applied Mathematics. 6 (1): 4–22. doi:10.1016/0196-8858(85)90002-8.

[Lattimore-3] Lattimore, Tor; Szepesvári, Csaba (2020). Bandit Algorithms. Cambridge: Cambridge University Press.

[Bouneffouf2019-4] Bouneffouf, Djallel; Rish, Irina (2019). "A survey on practical applications of multi-armed and contextual bandits". arXiv:1904.10040 [cs.LG].

[Gautron2024-5] Gautron, Romain; Baudry, Dorian; Adam, Myriam; Falconnier, Gatien N; Hoogenboom, Gerrit; King, Brian; Corbeels, Marc (2024). "A new adaptive identification strategy of best crop management with farmers". Field Crops Research. 307. Elsevier: 109249.

[Cappe2013-6] ^ ^a ^b ^c ^d ^e ^f ^g Cappé, Olivier; Garivier, Aurélien; Maillard, Odalric-Ambrym; Munos, Rémi; Stoltz, Gilles (2013). "Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation". The Annals of Statistics: 1516–1541.

[AudibertBubeck2009-7] Audibert, Jean-Yves; Bubeck, Sébastien (2009). Minimax policies for adversarial and stochastic bandits. Proceedings of the 22nd Annual Conference on Learning Theory (COLT). pp. 217--226.

[Garivier2022-8] Garivier, Aurélien; Hadiji, Hédi; Ménard, Pierre; Stoltz, Gilles (2022). "KL-UCB-switch: Optimal Regret Bounds for Stochastic Bandits from Both a Distribution-Dependent and a Distribution-Free Viewpoints". Journal of Machine Learning Research. 23 (179): 1–66.

[Magureanu2014-9] Magureanu, Stefan; Combes, Richard; Proutière, Alexandre (2014). "Lipschitz Bandits: Regret Lower Bounds and Optimal Algorithms". arXiv:1405.4758 [cs.LG].

[Baudry2023-10] Baudry, Dorian; Pesquerel, Fabien; Degenne, Rémy; Maillard, Odalric-Ambrym (2023). "Fast Asymptotically Optimal Algorithms for Non-Parametric Stochastic Bandits". Advances in Neural Information Processing Systems. 36: 11469–11514.

[Riou2020-11] Riou, Charles; Honda, Junya (2020). "Bandit Algorithms Based on Thompson Sampling for Bounded Reward Distributions". In Kontorovich, Aryeh; Neu, Gergely (eds.). Proceedings of the 31st International Conference on Algorithmic Learning Theory. Proceedings of Machine Learning Research. Vol. 117. PMLR. pp. 777–826.

[Honda2010-12] Honda, Junya; Takemura, Akimichi (2010). "An Asymptotically Optimal Bandit Algorithm for Bounded Support Models". COLT. pp. 67–79.

[Honda2015-13] Honda, Junya; Takemura, Akimichi (2015). "Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards". Journal of Machine Learning Research. 16 (113): 3721–3756.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]