首页 > 其他分享> > Upper-Confidence-Bound(UCB) Action Selection

Upper-Confidence-Bound(UCB) Action Selection



In ε-greedy method, we randomly choose non-greedy actions as exploration, but indiscriminately, with no preference for those that are nearly greedy or particularly uncertain.


In order to take into account both how close their estimates are to being maximal and the uncertainties in those estimates, one effective way is to select actions according to: A t ≐ a r g m a x a [ Q t ( a ) + c ln ⁡ t N t ( a ) ] A_t\doteq \underset{a}{argmax}[Q_t(a)+c\sqrt{\frac{\ln{t}}{N_t(a)}}] At​≐aargmax​[Qt​(a)+cNt​(a)lnt​ ​]

The idea of UCB action selection is that the square-root term c ln ⁡ t N t ( a ) c\sqrt{\frac{\ln{t}}{N_t(a)}} cNt​(a)lnt​ ​ is a measure of the uncertainty or variance in the estimate of a’s value. The quantity being max’ed over is a sort of upper bound on the possible true value of action a a a. Each time the action a a a is selected, the uncertainty is reduced. On the other hand, as the time step t t t goes larger, if the action other than a a a is selected, the uncertainty is increased.

Pros & Cons

  1. UCB is more difficult than ε-greedy method to extend beyond bandit problems.
  2. UCB has difficulties in dealing with large state spaces and nonstationary problems…

来源: https://blog.csdn.net/lun55423/article/details/111997844