其他分享
首页 > 其他分享> > A Framework for Reinforcement Learning and Planning

A Framework for Reinforcement Learning and Planning

作者:互联网


发表时间:2020
文章要点:这篇文章是篇综述,主要从RL和planning的异同入手,总结了对解决MDP这类问题的公共因素,放到一起称作framework for reinforcement learning and planning (FRAP)的框架。首先文章提出,RL和planning的主要区别就是环境模型是已知的还是未知的,通常RL用在环境model未知的情况,planning用在model已知的情况,然后两者结合起来就是现在的model based RL。然后这个综述主要回答了六个问题:

  1. where to put our computational effort,
  2. where to make our next trial,
  3. how to estimate the cumulative return,
  4. how to back-up,
  5. how to represent the solution
  6. how to update the solution.

对于环境的区别,主要有reversible和irreversible。Reversible是说环境state可以重置和回访,就相当于可以时间倒流。具体可以分为Reversible analytic和Reversible sample。Reversible analytic就是说环境的转移概率完全是已知的,可以直接计算,Reversible sample就是不知道具体形式,但是可以query的形式去收集样本。然后irreversible就是说环境不能重置和回访某个状态,就相当于只能往前走,不能回头。然后RL通常就用在irreversible的场景下,planning用在reversible的场景下(RL fundamentally limits itself to irreversible sample environments, planning always assumes a reversible environment (either analytic or sample))。
先介绍了planning。先定义planning为any process that takes a model as input and produces or improves a policy for interacting with the modeled environment。包括Dynamic programming (DP),Heuristic search,Sample-based search,Gradient-based planning,Direct optimization。然后planning又可以分为open-loop和closed-loop,open-loop就是说先全部规划好,然后依次执行动作。closed-loop就是说每次执行一个动作,都要重新再planning一次。
然后介绍了Model-free reinforcement learning,主要介绍了Value and policy,On- and off-policy bootstrapping,Exploration,Generalization,Direct policy optimization。
然后说了下Model-based reinforcement learning,主要介绍了Sampling additional data,Multi-step approximate dynamic programming,Backward trials,Value gradients。

介绍了基本概念之后,就开始回答六个问题

最后文章总结了这些部分,并且指出the lines between planning and learning are
actually blurry, and frequently based on convention rather than necessity。

总结:提到了很多方向,以后可以关注关注,比如planning里的open-loop or closed-loop form;two phase exploration里面The first step (Plan) is of course only feasible when we have a reversible model; tree的表征是不是也可以做成连续的,而不是离散的节点。
疑问:无。

标签:states,就是说,based,Planning,Framework,planning,exploration,Learning,policy
来源: https://www.cnblogs.com/initial-h/p/16133392.html