其他分享
首页 > 其他分享> > 18_Reinforcement Learning_CartPole_reduce_mean_Q-Value Iteration_Q-learning_DQN_get_weights

18_Reinforcement Learning_CartPole_reduce_mean_Q-Value Iteration_Q-learning_DQN_get_weights

作者:互联网

     Reinforcement Learning (RL) is one of the most exciting fields of Machine Learning today, and also one of the oldest. It has been around since the 1950s, producing many interesting applications over the years,(For more details, be sure to check out Richard Sutton and Andrew Barto’s book on RL, Reinforcement Learning: An Introduction (MIT Press).) particularly in games (e.g., TD-Gammon, a Backgammon-playing program) and in machine control, but seldom making the headline news. But a revolution took place in 2013, when researchers from a British startup called DeepMind demonstrated a system that could learn to play just about any Atari game from scratch,(Volodymyr Mnih et al., “Playing Atari with Deep Reinforcement Learning,” arXiv preprint arXiv:1312.5602 (2013).) eventually outperforming humans(Volodymyr Mnih et al., “Human-Level Control Through Deep Reinforcement Learning,” Nature 518 (2015): 529–533) in most of them, using only raw pixels as inputs and without any prior knowledge of the rules of the games.(Check out the videos of DeepMind’s system learning to play Space Invaders, Breakout, and other video games at https://homl.info/dqn3 OR https://www.youtube.com/watch?v=ePv0Fs9cGgU&list=PLkqhIF5RuX2c5SeAQE0wRw5gdOSjpixXI ) This was the first of a series of amazing feats, culminating[ˈkʌlmɪneɪtɪŋ]达到顶点 in March 2016 with the victory of their system AlphaGo against Lee Sedol, a legendary professional player of the game of Go, and in May 2017 against Ke Jie, the world champion. No program had ever come close to beating a master of this game, let alone the world champion. Today the whole field of RL is boiling with new ideas, with a wide range of applications. DeepMind was bought by Google for over $500 million in 2014.

     So how did DeepMind achieve all this? With hindsight it seems rather simple: they applied the power of Deep Learning to the field of Reinforcement Learning, and it worked beyond their wildest dreams. In this chapter we will first explain what Reinforcement Learning is and what it’s good at, then present two of the most important techniques in Deep Reinforcement Learning: policy gradients and deep Qnetworks (DQNs), including a discussion of Markov decision processes (MDPs). We will use these techniques to train models to balance a pole[poʊl]杆 on a moving cart; then I’ll introduce the TF-Agents library, which uses state-of-the-art algorithms that greatly simplify building powerful RL systems, and we will use the library to train an agent to play Breakout, the famous Atari game. I’ll close the chapter by taking a look at some of the latest advances in the field.

Learning to Optimize Rewards

     In Reinforcement Learning, a software agent makes observations and takes actions within an environment, and in return it receives rewards. Its objective is to learn to act in a way that will maximize its expected rewards over time. If you don’t mind a bit of anthropomorphism[ˌænθrəpəˈmɔːrfɪzəm]神人同形同性论, you can think of positive rewards as pleasure, and negative rewards as pain (the term “reward” is a bit misleading in this case). In short, the agent acts in the environment and learns by trial and error to maximize its pleasure and minimize its pain.

This is quite a broad setting, which can apply to a wide variety of tasks. Here are a few examples (see Figure 18-1):

Figure 18-1. Reinforcement Learning examples: (a) robotics, (b) Ms. Pac-Man, (c) Go player, (d) thermostat, (e) automatic trader (Image (a) is from NASA (public domain). (b) is a screenshot from the Ms. Pac-Man game, copyright Atari (fair use in this chapter). Images (c) and (d) are reproduced from Wikipedia. (c) was created by user Stevertigo and released under Creative Commons BY-SA 2.0. (d) is in the public domain. (e) was reproduced from Pixabay, released under Creative Commons CC0.)

     Note that there may not be any positive rewards at all; for example, the agent may move around in a maze[meɪz]迷宫,使混乱, getting a negative reward at every time step, so it had better find the exit as quickly as possible! There are many other examples of tasks to which Reinforcement Learning is well suited, such as self-driving cars, recommender systems, placing ads on a web page, or controlling where an image classification system should focus its attention.

Policy Search

     The algorithm a software agent uses to determine its actions is called its policy策略. The policy could be a neural network taking observations as inputs and outputting the action to take (see Figure 18-2).
Figure 18-2. Reinforcement Learning using a neural network policy

     The policy can be any algorithm you can think of, and it does not have to be deterministic[dɪˌtɜːrmɪˈnɪstɪk]确定性的. In fact, in some cases it does not even have to observe the environment! For example, consider a robotic vacuum cleaner whose reward is the amount of dust it picks up in 30 minutes. Its policy could be to move forward with some probability p every second, or randomly rotate left or right with probability 1 – p. The rotation angle would be a random angle between –r and +r. Since this policy involves some randomness, it is called a stochastic policy. The robot will have an erratic trajectory[trəˈdʒektəri] [物] 轨道,轨线, which guarantees that it will eventually get to any place it can reach and pick up all the dust. The question is, how much dust will it pick up in 30 minutes?

     How would you train such a robot? There are just two policy parameters you can tweak: the probability p and the angle range r. One possible learning algorithm could be to try out many different values for these parameters, and pick the combination that performs best (see Figure 18-3). This is an example of policy search, in this case using a brute [bruːt] force approach蛮力方法. When the policy space is too large (which is generally the case), finding a good set of parameters this way is like searching for a needle in a gigantic[dʒaɪˈɡæntɪk]巨大的 haystack[ˈheɪstæk]干草堆.
Figure 18-3. Four points in policy space (left) and the agent’s corresponding behavior (right)

     Another way to explore the policy space is to use genetic algorithms. For example, you could randomly create a first generation of 100 policies and try them out, then “kill” the 80 worst policies (It is often better to give the poor performers a slight chance of survival, to preserve some diversity in the “gene pool.”) and make the 20 survivors produce 4 offspring each. An offspring is a copy of its parent (If there is a single parent, this is called asexual[ˌeɪˈsekʃuəl]无性生殖的  reproduction. With two (or more) parents, it is called sexual reproduction. An offspring’s genome (in this case a set of policy parameters) is randomly composed of parts of its parents’ genomes) plus some random variation. The surviving policies plus their offspring together constitute the second generation. You can continue to iterate through generations this way until you find a good policy.

     Yet another approach is to use optimization techniques, by evaluating the gradients of the rewards with regard to the policy parameters, then tweaking these parameters by following the gradients toward higher rewards.(This is called Gradient Ascent. It’s just like Gradient Descent but in the opposite direction: maximizing instead of minimizing.) We will discuss this approach, is called policy gradients (PG), in more detail later in this chapter. Going back to the vacuum cleaner robot, you could slightly increase p and evaluate whether doing so increases the amount of dust picked up by the robot in 30 minutes; if it does, then increase p some more, or else reduce p. We will implement a popular PG algorithm using TensorFlow, but before we do, we need to create an environment for the agent to live in—so it’s time to introduce OpenAI Gym.

Introduction to OpenAI Gym

     One of the challenges of Reinforcement Learning is that in order to train an agent, you first need to have a working environment. If you want to program an agent that will learn to play an Atari game, you will need an Atari game simulator. If you want to program a walking robot, then the environment is the real world, and you can directly train your robot in that environment, but this has its limits: if the robot falls off a cliff, you can’t just click Undo. You can’t speed up time either; adding more computing power won’t make the robot move any faster. And it’s generally too expensive to train 1,000 robots in parallel. In short, training is hard and slow in the real world, so you generally need a simulated environment at least for bootstrap引导 training. For example, you may use a library like PyBullet or MuJoCo for 3D physics simulation.

     OpenAI Gym (OpenAI is an artificial intelligence research company, funded in part by Elon Musk. Its stated goal is to promote and develop friendly AIs that will benefit humanity (rather than exterminate it).) is a toolkit that provides a wide variety of simulated environments (Atari games, board games, 2D and 3D physical simulations, and so on), so you can train agents, compare them, or develop new RL algorithms.

Before installing the toolkit, if you created an isolated environment using virtualenv, you first need to activate it:

$ cd $ML_PATH # Your ML working directory (e.g., $HOME/ml)
$ source my_env/bin/activate # on Linux or MacOS
$ .\my_env\Scripts\activate # on Windows

Next, install OpenAI Gym (if you are not using a virtual environment, you will need to add the --user option, or have administrator rights):

$ python3 -m pip install -U gym

     Depending on your system, you may also need to install the Mesa OpenGL Utility (GLU) library (e.g., on Ubuntu 18.04 you need to run apt install libglu1-mesa). This library will be needed to render the first environment.

Next, open up a Python shell or a Jupyter notebook and create an environment with make():

import gym

Let's list all the available environments:

gym.envs.registry.all()

     The Cart-Pole is a very simple environment composed of a cart that can move left or right, and pole placed vertically on top of it. The agent must move the cart left or right to keep the pole upright.

create an environment with make()

env = gym.make('CartPole-v1')

After the environment is created, you must initialize it using the reset() method.. This returns an observation:

env.seed(42)
obs = env.reset()

     Observations vary depending on the environment. In this case it is a 1D NumPy array composed of 4 floats: they represent the cart's horizontal position(0.0 = center), its velocity(positive means right), the angle of the pole (0 = vertical), and the angular velocity角速度(positive means clockwise).

obs

Figure 18-4. The CartPole environment

     Here, we’ve created a CartPole environment. This is a 2D simulation in which a cart can be accelerated left or right in order to balance a pole placed on top of it (see Figure 18-4).

Now let’s display this environment by calling its render() method (see Figure 18-4) and and you can pick the rendering mode (the rendering options depend on the environment).. On Windows, this requires first installing an X Server, such as VcXsrv or Xming:

Warning: some environments (including the Cart-Pole) require access to your display, which opens up a separate window, even if you specify mode="rgb_array". In general you can safely ignore that window. However, if Jupyter is running on a headless server (ie. without a screen) it will raise an exception. One way to avoid this is to install a fake X server like Xvfb.

#########################################

On Debian or Ubuntu:

$ apt update
$ apt install -y xvfb

You can then start Jupyter using the xvfb-run command:

$ xvfb-run -s "-screen 0 1400x900x24" jupyter notebook

On google colab:

!apt-get install python-opengl -y
!apt install xvfb -y
!pip install pyvirtualdisplay
!pip install piglet
# !pip install gym

 you may to  Restart Runtime: then Run the above command again!

from IPython import display as ipythondisplay
from pyvirtualdisplay import Display

display = Display(visible=0, size=(1400, 900))
display.start()

#########################################

try:
    import pyvirtualdisplay
    display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()
except ImportError:
    pass


env.render()

 

If you want render() to return the rendered image as a NumPy array, you can set mode="rgb_array" (oddly, this environment will render the environment to screen as well):

img = env.render( mode='rgb_array' )
img.shape

def plot_environment( env, figsize=(5,4) ):
    plt.figure( figsize=figsize )
    img = env.render( mode="rgb_array" )
    plt.imshow( img )
    plt.axis("off")
    return img

import matplotlib.pyplot as plt
%matplotlib inline

plot_environment(env)
plt.show()

     Let's see how to interact with an environment. Your agent will need to select an action from an "action space" (the set of possible actions). Let's see what this environment's action space looks like:

env.action_space

 # action space
     Discrete(2) means that the possible actions are integers 0 and 1, which represent accelerating left (0) or right (1).

Other environments may have additional discrete actions, or other kinds of actions (e.g., continuous).

     Since the pole is leaning toward the right (obs[2] > 0, ), let’s accelerate the cart toward the right:

action = 1 # accelerate right
obs, reward, done, info = env.step( action )
obs

 

     Notice that (obs or observations)

Figure 18-4. The CartPole environment

The step() method executes the given action and returns four values:

#######################
Once you have finished using an environment, you should call its close() method to free resources.
#######################

plot_environment( env )

Looks like it's doing what we're telling it to do!

The environment also tells the agent how much reward it got during the last step:

reward


When the game is over, the environment returns done=True:

done

     Finally, info is an environment-specific dictionary that can provide some extra information that you may find useful for debugging or for training. For example, in some games it may indicate how many lives the agent has.

info


     The sequence of steps between the moment the environment is reset until it is done is called an "episode". At the end of an episode (i.e., when step() returns done=True), you should reset the environment before you continue to use it.

if done:
    obs = env.reset()

     Now how can we make the pole remain upright? We will need to define a policy for that. This is the strategy that the agent will use to select an action at each step. It can use all the past actions and observations to decide what to do.

     Let’s hardcode a simple policy that accelerates left when the pole is leaning toward the left and accelerates right when the pole is leaning toward the right. We will run this policy to see the average rewards it gets over 500 episodes:

env.seed(42)

def plot_environment( env, figsize=(5,4) ):
    plt.figure( figsize=figsize )
    img = env.render( mode="rgb_array" )
    plt.imshow( img )
    plt.axis("off")
    return img

def basic_policy(obs):
    angle = obs[2]
    return 0 if angle<0 else 1 # possible actions are integers 0 and 1

totals=[]
for episode in range( 500 ):
    episode_rewards = 0
    obs = env.reset()
    # plot_environment( env )
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, info = env.step( action ) # possible actions are integers 0 and 1
        # plot_environment( env )
        episode_rewards += reward
        
        if done:
            break
    totals.append( episode_rewards )

np.mean( totals ), np.std( totals ), np.min( totals ), np.max( totals )

This code is hopefully self-explanatory. Let’s look at the result: 


    Even with 500 tries, this policy never managed to keep the pole upright for more than 68 consecutive steps. Not great. If you look at the simulation in the Jupyter notebooks, you will see that the cart oscillates left and right more and more strongly until the pole tilts too much. Let’s see if a neural network can come up with a better policy.

Let's visualize one episode:

# Let's visualize one episode:

env.seed(42)

frames=[]

for step in range(200):
    img = env.render( mode='rgb_array' )
    frames.append(img)
    
    action = basic_policy( obs )
    obs, reward, done, info = env.step( action )
    
    if done:
        break
obs = env.reset() 

Now show the animation:

import matplotlib.animation as animation
import matplotlib as mpl
mpl.rc('animation', html='jshtml')

def update_scene( num, frames, patch ):
    patch.set_data( frames[num] )
    return patch,

def plot_animation( frames, repeat=False, interval=40 ):
    fig = plt.figure()
    patch = plt.imshow( frames[0] ) # img
    plt.axis( 'off' )
    
    anim = animation.FuncAnimation(
            fig,         # figure
            func=update_scene, # The function to call at each frame. 
            fargs=(frames, patch), # Additional arguments to pass to each call to func.
            frames = len(frames),  # iterable, int, generator function, or None, optional : Source of data to pass func and each frame of the animation
            repeat=repeat, 
            interval=interval
    )
    plt.close()
    return anim
plot_animation(frames)

 

Clearly the system is unstable and after just a few wobbles摇晃, the pole ends up too tilted: game over. We will need to be smarter than that!

https://gym.openai.com/docs/

env.observation_space

 

env.observation_space.shape[0]

 

env.observation_space.low

 

env.observation_space.high

 

obs

 

Neural Network Policies

Figure 18-5. Neural network policy

     Let’s create a neural network policy. Just like with the policy we hardcoded earlier, this neural network will take an observation as input, and it will output the action to be executed. More precisely, it will estimate a probability for each action, and then we will select an action randomly, according to the estimated probabilities (see Figure 18-5). In the case of the CartPole environment, there are just two possible actions (left or right), so we only need one output neuron. It will output the probability p of action 0 (left), and of course the probability of action 1 (right) will be 1 – p(==>keras.losses.binary_crossentropy). For example, if it outputs 0.7, then we will pick action 0 with 70% probability, or action 1 with 30% probability.

     You may wonder why we are picking a random action based on the probabilities given by the neural network, rather than just picking the action with the highest score. This approach lets the agent find the right balance between exploring new actions and exploiting[ɪkˈsplɔɪtɪŋ]利用;开拓,开发 the actions that are known to work well.(In general, exploitation will result in choosing actions with a greater short-term reward, whereas exploration can potentially result in greater total rewards in the long run.) Here’s an analogy: suppose you go to a restaurant for the first time, and all the dishes look equally appealing, so you randomly pick one. If it turns out to be good, you can increase the probability that you’ll order it next time, but you shouldn’t increase that probability up to 100%, or else you will never try out the other dishes, some of which may be even better than the one you tried.

     Also note that in this particular environment, the past actions and observations can safely be ignored, since each observation contains the environment’s full state. If there were some hidden state, then you might need to consider past actions and observations as well. For example,

Here is the code to build this neural network policy using tf.keras:

from tensorflow import keras
import tensorflow as tf
import numpy as np

keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

# The number of inputs is the size of the observation space (which in the case of Cart‐Pole is 4)
n_inputs=4 # == env.observation_space.shape[0]

model = keras.models.Sequential([
    # input is obs including 4 features related to target action
    keras.layers.Dense(5, activation="elu", input_shape=[n_inputs]),
    keras.layers.Dense(1, activation="sigmoid"),
])

 Figure 18-5. Neural network policy

     After the imports, we use a simple Sequential model to define the policy network. The number of inputs is the size of the observation space (which in the case of Cart‐Pole is 4), and we have just five hidden units because it’s a simple problem. Finally, we want to output a single probability (the probability of going left), so we have a single output neuron using the sigmoid activation function. If there were more than two possible actions, there would be one output neuron per action, and we would use the softmax activation function instead.

Let's write a small function that will run the model to play one episode, and return the frames so we can display an animation:

def render_policy_net( model, n_max_steps=200, seed=42 ):
    frames = []
    env = gym.make("CartPole-v1")
    
    env.seed(seed)
    np.random.seed(seed)
    
    obs = env.reset()
    
    for step in range(n_max_steps):
        frames.append( env.render( mode="rgb_array" ) )
        # obs # the features related to target action
        # array([-0.01258566, -0.00156614,  0.04207708, -0.00180545])
        # obs.reshape(1,-1)
        # array([[-0.01258566, -0.00156614,  0.04207708, -0.00180545]])
        left_proba = model.predict( obs.reshape(1,-1) )
        action = int( np.random.rand() > left_proba )
        obs, reward, done, info = env.step( action )
        if done:
            break
    env.close()
    return frames

Now let's look at how well this randomly initialized policy network performs:

frames = render_policy_net( model )
plot_animation(frames)

Yeah... pretty bad. The neural network will have to learn to do better.

     First let's see if it is capable of learning the basic policy we used earlier: go left if the pole is tilting left, and go right if it is tilting right.

     We can make the same net play in 50 different environments in parallel (this will give us a diverse training batch at each step), and train for 5000 iterations. We also reset environments when they are done. We train the model using a custom training loop so we can easily use the predictions at each training step to advance the environments.
(n is input num_instances)==>divided by num_instances==>(m is input num_instances, n is num_features or input_dimensions) ==>tf.reduce_mean to get the average loss across 50 different environments
https://blog.csdn.net/Linli522362242/article/details/96480059

n_environments = 50
n_iterations = 5000

# make the same net play in 50 different environments in parallel
envs = [ gym.make("CartPole-v1") for _ in range( n_environments) ]

for index, env in enumerate( envs ):
    env.seed( index )
np.random.seed(42)

observations = [ env.reset() for env in envs ]

optimizer = keras.optimizers.RMSprop() # Nadam
loss_fn = keras.losses.binary_crossentropy # since the output is a single probability (the probability of going left

for iteration in range( n_iterations ):
    # We directly judge the next target action based on the current angle
    # if angle(or obs[2]) < 0, we want proba(going to left) = 1.
    # if angle > 0, we want proba(going to left) = 0.
    target_probas = np.array([ ( [1.] if obs[2]<0 else [0.] )
                              for obs in observations # 50 different environments in parallel ~ 50 groups of observations
                             ])
    
    with tf.GradientTape() as tape:
        # Based on the current observations, we predict the probability of the next action
        left_probas = model( np.array(observations) ) # Prediction: output the probability of going left
        # Improve predictions(to make the next predicted value is close to the target value) by minimizing the loss function
        # Use the gradient of the loss function to update the weights so that the next predicted value is close to the target value
        # The optimizer can speed up this process or reduce the steps required in the process

        # - np.log(sigmoid(z))     # if y=1
        # - np.log(1 - sigmoid(z)) # if y=0
        loss = tf.reduce_mean( loss_fn(target_probas, left_probas) ) # across 50 different environments
    print( "\rIteration: {}, Loss: {:.3f}".format( iteration, loss.numpy() ), 
           end="" 
         )
    grads = tape.gradient( loss, model.trainable_variables )
    optimizer.apply_gradients( zip(grads, model.trainable_variables) ) # training
    # select an action randomly for 50 environments seperately         
                # if () return true, then goes to right(1), otherwise, goes to left(0)
    actions = ( np.random.rand( n_environments, 1 ) > left_probas.numpy() ).astype( np.int32 )
    for env_index, env in enumerate( envs ):
        obs, reward, done, info = env.step( actions[env_index][0] )
        observations[env_index] = obs if not done else env.reset()
        # note :env.reset() return obs: array([-0.02838569,  0.04781538,  0.0453429 , -0.03128895])
        
for env in envs:
    env.close()

frames = render_policy_net( model )
plot_animation( frames )

Looks like it learned the policy correctly. Now let's see if it can learn a better policy on its own. One that does not wobble as much(more wobbles 摇晃).

OK, we now have a neural network policy that will take observations and output action probabilities. But how do we train it?

Evaluating Actions: The Credit Assignment Problem

     If we knew what the best action was at each step, we could train the neural network as usual, by minimizing the cross entropy between the estimated probability distribution and the target probability distribution. It would just be regular supervised learning. However, in Reinforcement Learning the only guidance the agent gets is through rewards, and rewards are typically sparse and delayed. For example, if the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good, and which of them were bad? All it knows is that the pole fell after the last action, but surely this last action is not entirely responsible. This is called the credit assignment problem: when the agent gets a reward, it is hard for it to know which actions should get credited (or blamed) for it. Think of a dog that gets rewarded hours after it behaved well; will it understand what it is being rewarded for?
Figure 18-6. Computing an action’s return: the sum of discounted future rewards

     To tackle this problem, a common strategy is to evaluate an action based on the sum of all the rewards that come after it, usually applying a discount factor

标签:rewards,CartPole,Iteration,will,policy,state,learning,action,reward
来源: https://blog.csdn.net/Linli522362242/article/details/117673730