To derive equation (1), do we need some additional assumptions on the choice of $a_s$ (e.g. they form a basis)?

Best,

Zeyad Emam

Could you kindly point out a reference for the general setting, or briefly mention what the corresponding martingale is in the general setting?

]]>Yes, this result holds very generally. Only a martingale structure is being used and even the estimates can be any appropriately measurable function.

]]>Thanks a lot for your great blog and book!

I was reading Exercise 20.12 in the book on the sequential likelihood ratio confidence set extracted from Lemma 2 of Lai and Robbins (1985). This construction seems to be for the iid bandit. Can it be generalized to linear bandit as well?

]]>For the second question, you might start with this paper on generalised linear bandits: https://arxiv.org/pdf/1706.00136.pdf. Wouter Koolen and Remy Degenne also recently presented an algorithm using online learning to incrementally update the “policy”, but in the structured setting. I think that paper has not appeared yet.

]]>I am recently reading the chapters on the stochastic linear bandit and find the material covered here super useful. There are two things I get a little bit confused:

1.Unlike the chapters on finite-arm bandits, here ETC type of method (like PEGE in “Linearly Parameterized Bandits” by Paat Rusmevichientong et.al) is not covered. In this early paper, the author claims that PEGE could also achieve $\sqrt{T}$ regret because after $c$ rounds of uniform exploration the regret shrink as $\frac{1}{c}$. This sounds very counter-intuitive because finite arm case is finitely a special case of this and we already know this is impossible. What do you think of that?

2.In the proposed methods like LinUCB, a least square problem is solved in each step. I wonder if anyone has tried to use SGD style method instead of solving the least square directly. Of course in that case the construction of UCB can be quite different. I am only curious about this possibility.

Thanks！

]]>