其他分享
首页 > 其他分享> > David Silver--deep RL--Lesson1

David Silver--deep RL--Lesson1

作者:互联网

强化学习的特征

区别于其他的机器学习方法,强化学习:

  1. 没有监督者,但是有奖励信号。不同于监督学习和非监督学习,强化学习无所谓“正确的行为”,类似于一个小孩不断试错的过程。
  2. 反馈不是即使的,而是延迟的。eg:过了若干步之后才可以知道最初的选择是正确的还是错误的。
  3. 对于强化学习系统来说,时间非常重要
  4. 智能体的行为会影响它接收到的数据(所处环境)。

概念

  1. reward: a reward R t R_t Rt​is a scalar feedback signal. 区别于博弈中的正负目标函数,反馈信号表明智能体在第t步的选择有多正确。智能体的目标就是maximise cumulative reward.
  2. Reward Hypothesis: All goals can be described by the maximisation of expected cumulative reward.

在这里插入图片描述
在每一步智能体都得到两个输入,一个输出。

  1. History: the history is the sequence of obeservations, actions, rewards.
    H t = A 1 , O 1 , R 1 , ⋯   , A t , O t , R t H_t = A_1,O_1,R_1,\cdots,A_t,O_t,R_t Ht​=A1​,O1​,R1​,⋯,At​,Ot​,Rt​算法的目标是建立从history到action的映射。

  2. state: state is the information used to determine what happens next. state包含了所有我们下一步做出决策所需要的信息。
    S t = f ( H t ) S_t=f(H_t) St​=f(Ht​)

  3. environment state: The environment state S t e S^e_t Ste​ is the environment’s private representation. 环境状态包括了环境用来采取下一步输入或输入的全部信息。环境状态对智能体来说通常有一部分不可知,且有时候环境状态的信息对于智能体做出选择可能无用。

  4. agent state: The agent state S t a S^a_t Sta​ is the agent’s internal representation.这才是强化学习算法中智能体用到的信息。
    S t a = f ( H t ) S^a_t=f(H_t) Sta​=f(Ht​)

  5. information state: A information state (a.k.a. Markov state contains all useful information from the history.

  6. A state S t S_t St​ is Markov if and only if: P [ S t + 1 ∣ S t ] = P [ S t + 1 ∣ S 1 , ⋯   , S t ] P[S_{t+1}|S_t]=P[S_{t+1}|S_1,\cdots,S_t] P[St+1​∣St​]=P[St+1​∣S1​,⋯,St​]
    The state is independent of the past given the present.
    H 1 : t → S t → H t + 1 : ∞ H_{1:t}\rightarrow S_t\rightarrow H_{t+1:\infty} H1:t​→St​→Ht+1:∞​
    将得到的状态S存入 S t S_t St​后,根据Markov性质,state包含了所有的游泳信息,故可以舍弃history而只观察state.
    eg: environment state is Markov. entire history is also Markov state.

  7. Full observability: agent directly observes environment state
    O t = S t a = S t e O_t=S^a_t=S^e_t Ot​=Sta​=Ste​
    最好的情况:智能体可以获得环境的全部信息。
    agent state = environment state = information state.
    Formally, this is a Markov decision process (MDP).

  8. Partial observability: agent indirectly observes environment.
    eg: a robot with camera vision isn’t told its absolute location.
    agent state ≠ environment state.
    Formally, this is a partially observable Markov decision process (POMDP).
    Agent must construct its own state representation S t a S^a_t Sta​.
    eg: S t a = H t S^a_t=H_t Sta​=Ht​
    or Bayes: S t a = ( P [ S t e = s 1 ] , ⋯   , P [ S t e = s n ] ) S^a_t=(P[S^e_t=s_1],\cdots,P[S^e_t=s^n]) Sta​=(P[Ste​=s1​],⋯,P[Ste​=sn])

  9. exploration与exploitation:
    Exploration 是要在全局范围内进行不确定策略的搜索,有选择地放弃某些奖励来获取更多关于环境的信息,避免陷入 local;
    Exploration finds more information about the environment.
    Exploitation 是要在当前最优的附近搜索,开发利用已知的信息,来最大化奖励,找到更好的解。
    Exploitation exploits known information to maximise reward.

  10. prediction与control:
    Prediction: evaluate the future. Given a policy,
    Control: optimise the future. Find the best policy.

RL智能体的主要组成(非全部必需)

  1. Policy: agent’s behaviour function. It’s a map from state to action. a = π ( s ) a=\pi(s) a=π(s)s:situation
    Stochastic policy(随机决策): π ( a ∣ s ) = P [ A = a ∣ A = s ] \pi(a|s)=P[A=a|A=s] π(a∣s)=P[A=a∣A=s]
  2. Value function: how good is each state and/or actor. A prediction of future reward
    v π ( s ) = E [ R t + γ R t + 1 + γ 2 R t + 2 + ⋯ ∣ S t = s ] v_\pi(s)=\mathbb{E}[R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+\cdots|S_t=s] vπ​(s)=E[Rt​+γRt+1​+γ2Rt+2​+⋯∣St​=s] γ \gamma γ是衰减因子,表示我们虽然考虑未来的收益,但是更重视当前的收益。
  3. Model: agent’s representation of the environment (maybe not real environment, but is the view of agent)
    A model predicts what the environment will do next.
  4. Transitions: P predicts the next state. 动态特性
    P s s ′ a = P [ S ′ = s ′ ∣ S = s , A = a ] R s a = E [ R ∣ S = s , A = a ] P^a_{ss^{\prime}}=P[S^{\prime}=s^{\prime}|S=s,A=a]\\R^a_s=\mathbb{E}[R|S=s,A=a] Pss′a​=P[S′=s′∣S=s,A=a]Rsa​=E[R∣S=s,A=a]
  5. Rewards: R predicts the next reward.
    注意区分状态价值函数和奖励:状态价值函数包含预测,看得很长远;奖励是即时性的,是对当下环境的反馈。状态价值函数由多个即时奖励组成。

RL中的基本问题

连续决策问题中的两类基本问题:

  1. 强化学习:
    环境最初是未知的;智能体与环境有交互;智能体优化它的决策(The agent improves its policy.)
  2. 规划问题:
    环境是已知的;智能体用模型进行模拟,而不需要实际与环境产生交互;智能体优化它的决策。

标签:information,Lesson1,--,agent,St,environment,智能,state,deep
来源: https://blog.csdn.net/pikapuka/article/details/120559748