The variance of Exp3

In an earlier post we analyzed an algorithm called Exp3 for $k$-armed adversarial bandits for which the expected regret is bounded by
\begin{align*}
R_n = \max_{a \in [k]} \E\left[\sum_{t=1}^n y_{tA_t} – y_{ta}\right] \leq \sqrt{2n k \log(k)}\,.
\end{align*}
The setting of this bound is that $(y_t)_{t\in [n]}$ is a sequence of $[0,1]^k$-valued vectors chosen arbitrarily. Further, in round $t$, Exp3 chooses the random $A_t\in [k]$ and learns its own loss $y_{t A_t}$.

Why the expectation in the above bound? Well, the algorithm is randomizing ($A_t$ is random), hence, if we want to assign some fixed numerical measure to how well the algorithm did, an obvious candidate is to take the expectation of the random regret
\begin{align*}
\hat R_n = \max_{a \in [k]} \sum_{t=1}^n (y_{tA_t} – y_{ta})\,.
\end{align*}
The expectation measures the central tendency of $\hat R_n$, but it bears remembering that it is not the only game in town. In particular, who would not want to know more about the distribution of $\hat R_n$ than its mean?

Now, quickly! Before you read anymore, think and can you guess the magnitude of the variance $\Var[{\hat R}_n]$ for Exp3? If you are like us, you like stable and predictable algorithms and then you must have a strong desire to keep the variance small.

Of course, one can also look at the tails of $\hat R_n$ to get a sense of its distribution. In fact we already did this in the aforementioned post, but for a different algorithm called Exp3-IX. For this algorithm we proved that for any $\delta\in [0,1]$, with probability at least $1 – \delta$, the random regret satisfies
\begin{align*}
\hat R_n = O\left(\sqrt{nk \log(k)} + \sqrt{\frac{nk}{\log(k)}} \log(1/\delta)\right)\,.
\end{align*}
By integrating the above bound you can show that $\Var[\hat R_n] = O(n)$. This seems quite reasonable relative to the worst possible variance, which is $\Omega(n^2)$ (think about why).

You might wonder whether these modifications are necessary. Perhaps Exp3 without modification also enjoys such a bound? Or not? The ugly truth is that Exp3 will have a huge variance. In fact, the situation is as bad as you can possibly imagine: The variance of the regret of Exp3 can be as high as $\Omega(n^2)$. We will show this by demonstrating the existence of adversarial bandits for which the random regret is linear with constant probability. Now you may say: “Wait a second, does not this imply that the expected regret will also be linear?” Not so fast, we would say: the random regret can be negative and since we know that the expected regret is sublinear, we also know that it will be negative and quite negative quite often. What a nice algorithm!

To state our result, we need to discuss how the learning rate of Exp3 should be tuned. For this, recall that Exp3 is the algorithm that samples $A_t$ from $P_t$ given by
\begin{align*}
P_{ta} = \frac{\exp\left(-\eta \sum_{s=1}^{t-1} \hat y_{sa}\right)}{\sum_{b=1}^k \exp\left(-\eta \sum_{s=1}^{t-1} \hat y_{sb}\right)}\,,
\end{align*}
where $\hat y_{sa} = \one{A_s = a} y_{sa} / P_{sa}$ is the importance-weighted loss estimator. Here, recall that $\eta>0$ is the learning rate to be chosen by the user. In particular, the claim that the variance of Exp3 is large will obviously depend on the choice of this parameter. For example, choosing $\eta=0$ (no learning!) will keep the variance of Exp3 nice and small — at most $O(n)$, which is as low as the variance can be in this situation. But no learning is no fun: We demand that the learning should be chosen to guarantee that the expected regret is of order $O(\sqrt{n})$. As it turns out, this demand can only be met if $\eta = \Theta(\sqrt{1/n})$, leading to the main result of the present post:

Theorem (Variance of Exp3):
Suppose that $\eta = a / \sqrt{n}$ for some constant $a$. Then for $n$ large enough there exists a sequence of losses such that $\mathbb{P}(\hat R_n \ge n/4)\ge 1/65$. In particular, this means that $\Var[\hat R_n]=\Omega(n^2)$.

The proof of this statement is ultimately little more than an algebraic calculation. We won’t do this calculation here, but settle for explaining how things go wrong. The bandit that causes issues is the following:
\begin{align*}
y_{t1} = \begin{cases}
0 &\text{if } t < n/2 \\
1 &\text{otherwise}
\end{cases}
y_{t2} = \begin{cases}
\alpha &\text{if } t < n/2 \\
0 &\text{otherwise}\,.
\end{cases}
\end{align*}
where $\alpha \in [1/4,1/2]$ is a carefully chosen constant. In this bandit the second arm is optimal, but for the first $n/2$ rounds looks quite suboptimal. We will argue that with constant probability Exp3 does not play the second arm at all in the last $n/2$ rounds. Obviously, when this happens than its regret will be at least
\begin{align*}
\hat R_n \geq n/2 – \alpha n/2 \geq n/4\,,
\end{align*}
which will give us what we want about the variance of $\hat R_n$.

To see why Exp3 will keep choosing the first arm even in the second half of the episode with constant probability, let’s think about how Exp3 behaves in the first $n/2$ rounds. Since the loss of the first arm is always zero in this period, we have $\hat y_{t1} = 0$ for all $t < n/2$. On the other hand, if $A_t = 2$, then $\hat y_{t2} = \alpha / P_{t2}$. Let $q_0(\alpha) = 1/2$ and \begin{align*} q_{s+1}(\alpha) = \frac{q_s(\alpha) \exp(-\eta \alpha / q_s(\alpha))}{1 – q_s(\alpha) + q_s(\alpha) \exp(-\eta \alpha / q_s(\alpha))}\,. \end{align*} Then by induction it is easy to check that $P_{t2} = q_s(\alpha)$ where $s = T_2(t-1)$ is the number of times action $2$ has been played at the start of round $t$. To see this, recall the incremental update \begin{align*} P_{t+1,a} = \frac{P_{ta} \exp(-\eta \hat y_{ta})}{\sum_{b=1}^k P_{tb} \exp(-\eta \hat y_{tb})}\,. \end{align*} For a fixed $\alpha$ the sequence $q_0(\alpha),q_1(\alpha),\ldots,q_n(\alpha)$ is obviously decreasing. At first it decreases quite slowly. A Taylor series expansion suggests that $q_{s+1}(\alpha) \approx q_s(\alpha)(1 – \alpha \eta)$. This approximation is reasonable until $q_s(\alpha)$ gets small and then suddenly $q_{s+1}(\alpha)$ becomes vanishingly small extremely rapidly. See the figure below with $\eta = 0.01$ and $\alpha = 1/2$. A log scale would emphasize the severity of the drop at the end when $q_s(\alpha)$ goes from $10^{-3}$ to $10^{-5}$ to $10^{-67}$ in just three steps.

Suppose Exp3 has just played action $A_t = 2$ and $T_2(t) = s$. In how many rounds do we expect it to play the second action again? Well, in each round until action $2$ is played we have $P_{t2} = q_s(\alpha)$. So that except for the change that occurs when $t \geq n/2$, we should expect it to take roughly $1/q_s(\alpha)$ rounds before the second action is played again. Now the idea is to show that for sufficiently large $n$ there exists a choice of $\alpha$ such that for some $s$:
\begin{align*}
\end{align*}
This is where $\alpha$ comes in. For poorly chosen $\alpha$ the function $q_s(\alpha)$ can be much larger than $1/(8n)$ for some $s$ and then much much much smaller for $q_{s+1}(\alpha)$. However, by appealing to the intermediate value theorem and a little messing around you can show that there exists an $\alpha$ that satisfies the above display.

Then by the second part and Markov’s inequality it holds with probability at least $1/2$ that $T_2(n/4) \geq s$ and on this event there is a constant probability that $T_2(n/2-1) \geq s+1$ because there are $\Omega(n)$ chances for an $\Omega(1/n)$ probability event to occur. But on this event we have
\begin{align*}
q_{s+1}(\alpha) \leq \exp(-8n \eta \alpha) \leq \exp(-2 \eta n)\,,
\end{align*}
which for our choice of $\eta$ and large enough $n$ is phenomenally small. When this occurs the algorithm never recovers: with high probability, it will keep playing arm $1$ for the remaining $n/2$ rounds.

Take home message

By default, algorithms with good expected regret do not enjoy high probability bounds. The adversarial model is often sold as a robust version of the stochastic model, but if we are not sufficiently careful, the algorithms we propose might not be robust at all. Luckily for us, the required corrections are usually quite straightforward. Nevertheless, this story about the high variance of Exp3 should be a reminder that we should always check the variance of our algorithms.

Notes

Note 1: As we remarked, the constant probability of linear regret is offset in expectation by a constant probability of negative regret. This occurs in the provided example when Exp3 does not play the second arm $s+1$ times in the first $n/2$ rounds. Then it quickly identifies that $a = 2$ is optimal, ultimately playing $A_t = 1$ for most rounds $t < n/2$ and $A_t =2$ for most rounds $t \geq n/2$, which makes the regret very negative.

Note 2: The original algorithm proposed by Peter Auer et al. includes additional exploration, which reduces the variance. The algorithm samples $A_t$ from $\tilde P_t$ given by $\tilde P_{ta} = (1 – \gamma) P_{ta} + \gamma / k$. When $\gamma = O(n^{-p})$ for $p \in (0,1)$ you can show that for sufficiently large $n$ there exists a bandit such that
\begin{align*}
\Prob{\hat R_n \geq c \sqrt{n^{1+p}}} \geq c’\,,
\end{align*}
where $c, c’ > 0$ are universal constants. The standard recommendation that $\gamma = O(n^{-1/2})$ leads to an algorithm for which the variance of the regret is $\Omega(n^{3/2})$. Note that the expected regret is at least $\Omega(n\gamma)$, so choosing a larger $\gamma$ leads to a larger expected regret. In comparison, the variance of the regret for Exp-IX and Exp3.P is $O(n)$, while the expected regret of these algorithms is $O(\sqrt{n})$.

Note 3: Exp3 without exploration is the same as FTRL or mirror descent on the probability simplex with the unnormalized negentropy potential. How cleverly the analysis ‘hides’ the disastrous variance of the resulting algorithm! Perhaps this is because the analysis is hardly probabilistic (only unbiasedness is used, essentially).

Note 4: There are generic ways to prove high probability bounds for bandit algorithms based on mirror descent. We hope this post inspires people to use them a little more. See this paper by Abernethy and Rakhlin.

Note 5: After seeing the first version of this post, fellow bandit aficionado, Sebastian Bubeck pointed out to us that Thompson sampling in the Bayesian setting is subject to the same phenomenon. This is done in the context of Bayesian bandits where there is a prior distribution over the sequences of losses. The relation to the adversarial setting is that the optimal Bayesian regret (when expectation is also taken over the prior) is equal to the minimax regret, a result that we will blog about soon. The claim is that there exist a prior over these sequences which makes the regret of Thompson sampling linear with constant probability. See Theorem 12 in their paper. The construction used is exactly the construction described here. Another one bites the dust. (We will also need to blog about Thompson sampling.)

Note 6: At this stage the reader may also be wondering about the variance of other bandit algorithms in other bandit settings. (We hope you do!) What happens for example with UCB in the stochastic $k$-armed bandit setting? Here, the situation is not so gloomy: stationary stochastic environments cannot mislead that much: The probability of linear regret will decay with time. However, there is no “exponential concentration” of the regret around its $O(\log(n))$ mean: The decay is polynomial, which may be quite disturbing if you are risk averse. These results are from a paper joint with Jean-Yves Audibert and Remi Munos and one of the authors of this blog.

Bibliographic remarks

As far as we know this result is new. It appears in the exercises of Chapter 11 of the book.

That Exp3 does not need additional exploration to enjoy a bound in expectation was noticed by Gille Stoltz. Nowadays this can be seen easily using the FTRL/mirror descent view.

There are several approaches for proving high probability bounds, including implicit exploration (leading to the Exp3-IX algorithm mentioned above) and adding exploration and biasing the loss estimators (leading to what’s called the Exp3.P algorithm).