首页 > 其他分享> > Discovering and Achieving Goals via World Models

Discovering and Achieving Goals via World Models

2022-05-04 22:32:31 作者：互联网

发表时间：2021（NeurIPS 2021）
文章要点：这篇文章提出Latent Explorer Achiever (LEXA)算法，通过学习world model的imagined rollouts来训练一个explorer策略和一个achiever策略，通过unsupervised learning学习策略，最后可以zero-shot迁移到其他任务。这个方式的好处在于之前的探索方法只能让agent返回到之前访问过的state，而用world model和explorer可以发现没去过的state，然后就可以生成多种多样的target作为训练目标，实现unsupervised learning和zero-shot迁移到其他任务（Unlike prior methods that explore by reaching previously visited states, the explorer plans to discover unseen surprising states through foresight）。
具体的，分别训练explorer和achiever，先通过explorer在model里planning，发现novel state，然后在真正的环境里执行这个动作序列得到真实的state，最后用这个state作为achiever的target来学习。学完之后，这个achiever就可以直接用来完成其他任务了（the achiever solves tasks specified as goal images zero-shot without any additional learning）。

而这个world model的训练用的Recurrent State Space Model (RSSM)（Learning Latent Dynamics for Planning from Pixels）

Explorer的训练目标是最大化exploration reward，这个reward通过估计model uncertainty获得。先是训练一组model来做1-step prediction，然后用这些model的预测方差来作为reward

然后就在model里面，再加上这个reward用RL去学policy和value

这个方式和dreamer算法一样（DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION）。完了之后，这个policy会在真实环境中来采样轨迹到buffer里用来更新model。
然后achiever会根据采样的goal，在model里面训练achiever的策略

这里\(x_g\)是从buffer里采样的真实环境的goal，然后用encoder得到embedding \(e_g\)，这个任务的reward就是一个到goal的距离度量，这个距离可以是当前state和goal state的余弦值

或者是学的一个和多少步能走到goal state的一个相关值

然后训练也用dreamer来做。整个算法如下

总结：非常好的一个思路啊，主要的好处就是可以探索没有去过的state，而之前的探索方法都是先到过，然后通过加reward来重复到。相当于一个是foresight，一个是hindsight。另外他还直接做到了unsupervised learning，用explorer来找新的goal，用achiever来学最优策略，实现了zero-shot，相当于又更进一步了。之前我也想用model uncertainty的方式来找新的状态，然后做动作去环境里探索，再用DQN的方式更新，果然很多人都能想到。
疑问：感觉抄的Planning to Explore via Self-Supervised World Models啊，简直一模一样。细看了一下作者，居然是一拨人。这，为啥分开发了两篇顶会？非要说区别，就是之前那个只做到few-shot，这个做到zero-shot？之前没有严格区别explorer和achiever？

标签：via,shot,explorer,Models,achiever,state,Goals,model,goal
来源： https://www.cnblogs.com/initial-h/p/16222721.html