首页 > 其他分享> > (基础)奖励函数的入口参数详解 - 005

(基础)奖励函数的入口参数详解 - 005

2021-10-10 20:03:33 作者：互联网

文章目录

奖励函数的形式
入口参数params
总结

本文是根据AWS 官方文档为主，自己理解加解析为辅的文章。
参考链接: link

奖励函数的形式

奖励函数名为reward_function 应该是不能修改的，入口参数params是一个字典类型

def reward_function(params):
    '''
    Example of rewarding the agent to follow center line
    '''
    
    # Read input parameters
    track_width = params['track_width']
    distance_from_center = params['distance_from_center']
    
    # Calculate 3 markers that are at varying distances away from the center line
    marker_1 = 0.1 * track_width
    marker_2 = 0.25 * track_width
    marker_3 = 0.5 * track_width
    
    # Give higher reward if the car is closer to center line and vice versa
    if distance_from_center <= marker_1:
        reward = 1.0
    elif distance_from_center <= marker_2:
        reward = 0.5
    elif distance_from_center <= marker_3:
        reward = 0.1
    else:
        reward = 1e-3  # likely crashed/ close to off track
    
    return float(reward)

入口参数params

params包含许多个参数，可选择合适的参数来设计奖励策略，使得代理能够获得更多的奖励值。本文章只做简单讲述，后面进阶和实战将会详细讲解。
参数列表入下:

{
    "all_wheels_on_track": Boolean,        # flag to indicate if the agent is on the track
    "x": float,                            # agent's x-coordinate in meters
    "y": float,                            # agent's y-coordinate in meters
    "closest_objects": [int, int],         # zero-based indices of the two closest objects to the agent's current position of (x, y).
    "closest_waypoints": [int, int],       # indices of the two nearest waypoints.
    "distance_from_center": float,         # distance in meters from the track center 
    "is_crashed": Boolean,                 # Boolean flag to indicate whether the agent has crashed.
    "is_left_of_center": Boolean,          # Flag to indicate if the agent is on the left side to the track center or not. 
    "is_offtrack": Boolean,                # Boolean flag to indicate whether the agent has gone off track.
    "is_reversed": Boolean,                # flag to indicate if the agent is driving clockwise (True) or counter clockwise (False).
    "heading": float,                      # agent's yaw in degrees
    "objects_distance": [float, ],         # list of the objects' distances in meters between 0 and track_length in relation to the starting line.
    "objects_heading": [float, ],          # list of the objects' headings in degrees between -180 and 180.
    "objects_left_of_center": [Boolean, ], # list of Boolean flags indicating whether elements' objects are left of the center (True) or not (False).
    "objects_location": [(float, float),], # list of object locations [(x,y), ...].
    "objects_speed": [float, ],            # list of the objects' speeds in meters per second.
    "progress": float,                     # percentage of track completed
    "speed": float,                        # agent's speed in meters per second (m/s)
    "steering_angle": float,               # agent's steering angle in degrees
    "steps": int,                          # number steps completed
    "track_length": float,                 # track length in meters.
    "track_width": float,                  # width of the track
    "waypoints": [(float, float), ]        # list of (x,y) as milestones along the track center

}

all_wheels_on_track

类型：Boolean
范围：(True:False)
一个 Boolean 标记，指示代理是在赛道上还是偏离赛道。
如果车辆的任一车轮位于赛道边界外，则将车辆视为偏离赛道 (False)。如果车辆的所有车轮都在两个赛道边界内，则将车辆视为在赛道上 (True)。
使用技巧: 在第一次模型中用来使小车能快速保证完成，即小车不出界
下图显示了代理在赛道上。
在这里插入图片描述
下图显示了代理偏离赛道。

示例代码: all_wheels_on_track

def reward_function(params):
    #############################################################################
    '''
    Example of using all_wheels_on_track and speed
    '''

    # Read input variables
    all_wheels_on_track = params['all_wheels_on_track']
    speed = params['speed']

    # Set the speed threshold based your action space
    SPEED_THRESHOLD = 1.0

    if not all_wheels_on_track: 
        # Penalize if the car goes off track #如果出界则惩罚
        reward = 1e-3
    elif speed < SPEED_THRESHOLD:
        # Penalize if the car goes too slow
        reward = 0.5
    else:
        # High reward if the car stays on track and goes fast
        reward = 1.0

    return float(reward)

closest_waypoints

Type: [int, int]
这个参数在设计奖励函数中非常重要
范围：[(0:Max-1),(1:Max-1)]
最接近代理当前位置 (x, y) 的两个相邻 waypoint 的从零开始的索引。距离是根据与代理中心的欧氏距离来测量的。第一个元素指代理后面最近的路点，第二个元素指代理前面最近的路点。Max 是路点列表的长度。在waypoints的图示中，closest_waypoints 将为 [16, 17]。
示例：奖励函数closest_waypoints参数。
以下示例奖励函数演示如何使用 waypoints、closest_waypoints 和 heading 来计算即时奖励。
AWS DeepRacer 支持以下库：数学库、随机库、NumPy 库、SciPy 库和塑形库。要使用一个，请添加一个导入语句import supported library，在你的函数定义之上，def function_name(parameters)。

# Place import statement outside of function (supported libraries: math, random, numpy, scipy, and shapely)
# Example imports of available libraries
#
# import math
# import random
# import numpy
# import scipy
# import shapely

import math

def reward_function(params):
    ###############################################################################
    '''
    Example of using waypoints and heading to make the car point in the right direction
    '''

    # Read input variables
    waypoints = params['waypoints']
    closest_waypoints = params['closest_waypoints']
    heading = params['heading']

    # Initialize the reward with typical value
    reward = 1.0

    # Calculate the direction of the center line based on the closest waypoints
    next_point = waypoints[closest_waypoints[1]]
    prev_point = waypoints[closest_waypoints[0]]

    # Calculate the direction in radius, arctan2(dy, dx), the result is (-pi, pi) in radians
    track_direction = math.atan2(next_point[1] - prev_point[1], next_point[0] - prev_point[0])
    # Convert to degree  将弧度转化为角度
    track_direction = math.degrees(track_direction)

    # Calculate the difference between the track direction and the heading direction of the car
    direction_diff = abs(track_direction - heading)
    if direction_diff > 180:
        direction_diff = 360 - direction_diff

    # Penalize the reward if the difference is too large
    DIRECTION_THRESHOLD = 10.0
    if direction_diff > DIRECTION_THRESHOLD:
        reward *= 0.5

    return float(reward)

closest_objects

Type: [int, int]
凡是带有objects的变量都是在多人小车才会用到，个人计时赛中使用不上。
范围：[(0:len(object_locations)-1), (0:len(object_locations)-1]
最接近代理当前位置 (x, y) 的两个物体的从零开始的索引。第一个索引指代理后面最近的物体，第二个索引指代理前面最近的物体。如果只有一个物体，则两个索引都为 0。

distance_from_center

Type: float
范围：0:~track_width/2
代理中心和赛道中心之间的位移（以米为单位）。当代理的任一车轮位于赛道边界外时可观察到的最大位移，并且根据赛道边界的宽度，它可以略小于或大于 track_width 的一半。实战中时大于track_width的一半，要注意这里没有分是在赛道中心的左边还是右边，可通过结合is_left_of_center使用来判断实际位置是在左边还是右边，以及配合is_reversed判断是顺时针还是逆时针，这将非常有助于设计奖励函数。
在这里插入图片描述
示例：奖励函数distance_from_center
该函数保证小车沿中线走的时候给的奖励值越多。

def reward_function(params):
    #################################################################################
    '''
    Example of using distance from the center
    '''

    # Read input variable
    track_width = params['track_width']
    distance_from_center = params['distance_from_center']

    # Penalize if the car is too far away from the center
    marker_1 = 0.1 * track_width
    marker_2 = 0.5 * track_width

    if distance_from_center <= marker_1: #越靠近中线奖励越多
        reward = 1.0
    elif distance_from_center <= marker_2:
        reward = 0.5
    else:
        reward = 1e-3  # likely crashed/ close to off track 出界或靠近赛道边缘将受到惩罚

    return float(reward)

heading

Type: float
范围：-180:+180, 其中X轴上方为正，下方为负。
使用技巧:用来结合最近两点之间的直线与x轴夹角进行比较，从而使小车沿着轨道线跑
代理相对于坐标系 x 轴的前进方向的夹角（以度为单位）。
在这里插入图片描述

is_crashed

Type: Boolean
范围：(True:False)
一个布尔标记，用于指示代理的最终状态是否为撞向另一个物体（True 或 False），这个参数在个人计时赛中用不上。

is_left_of_center

Type: Boolean
范围：[True : False]
一个 Boolean 标记，用于指示代理是位于赛道中心的左侧 (True) 还是右侧 (False)。
该参数在简单地图中可用来使小车沿着外线高速跑，这在实战中非常有用，在实战赛道 [re:Invent 2018] 中，第一名就是沿着外沿切线高速过弯，从而取得较好成绩。

is_offtrack

Type: Boolean
范围：(True:False)
一个布尔标记，用于指示代理的最终状态是否为脱离赛道（True 或 False）。
该参数可用来惩罚代理冲出赛道的行为

is_reversed

Type: Boolean
范围：[True:False]
一个布尔标记，用于指示代理是顺时针行驶 (True) 还是逆时针行驶 (False)。
此参数在您针对每个过程改变方向时使用。

progress

Type: float
范围：0:100
赛道完成百分比。
示例：奖励函数progress


def reward_function(params):
    #############################################################################
    '''
    Example of using rogress
    '''
    progress = params['progress']
    # Initialize the reward with typical value
    reward = 1.0

    # 如果完成一圈赛道，给与奖励
    if int(progress) == 100 :
        reward += 10.0
    return float(reward)

speed

Type: float
范围：[0.0,5.0]
观察到的代理速度，以米/秒 (m/s) 为单位。

在这里插入图片描述

steering_angle

Type: float
范围：-30:30
前轮与代理中心线之间的转向角（以度为单位）。负号 (-) 表示向右转向，正号 (+) 表示向左转向。代理中心线不一定与赛道中心线平行，如下图所示。
在这里插入图片描述
示例：奖励函数steering_angle

def reward_function(params):
    '''
    Example of using steering angle
    '''

    # Read input variable
    abs_steering = abs(params['steering_angle']) # We don't care whether it is left or right steering   #  我们不关注他是在左边还是右边

    # Initialize the reward with typical value
    reward = 1.0

    # Penalize if car steer too much to prevent zigzag  # 如果角度转的太大给与惩罚
    ABS_STEERING_THRESHOLD = 20.0
    if abs_steering > ABS_STEERING_THRESHOLD:
        reward *= 0.8

    return float(reward)

steps

Type: int
范围：0:N steps
这个是非常重要的参数，可以用来防止代理走Z型路线，以及可以用来奖励加速，这个参数将在实战中重点讲解。
完成的步骤数。步骤对应于代理按照当前策略所采取的操作。
示例：奖励函数steps

def reward_function(params):
    #############################################################################
    '''
    Example of using steps and progress
    '''

    # Read input variable
    steps = params['steps']
    progress = params['progress']

    # Total num of steps we want the car to finish the lap, it will vary depends on the track length
    TOTAL_NUM_STEPS = 300

    # Initialize the reward with typical value
    reward = 1.0

    # Give additional reward if the car pass every 100 steps faster than expected
    if (steps % 100) == 0 and progress > (steps / TOTAL_NUM_STEPS) * 100 :
        reward += 10.0

    return float(reward)

track_length

Type: float
范围：[0:Lmax]
赛道长度（以米为单位）。Lmax is track-dependent. Lmax是一个定值，取决于赛道本身，例如
re:Invent 2018 赛道长为: 17.71 m
The 2019 DeepRacer Championship Cup 赛道长为: 23.12 m

track_width

Type: float
范围：0:D track
赛道宽度（以米为单位）,赛道宽度不是固定的，实战中发现赛道在不同点是不一样的。
在这里插入图片描述
示例：奖励函数track_widthparameter

def reward_function(params):
    #############################################################################
    '''
    Example of using track width
    '''

    # Read input variable
    track_width = params['track_width']
    distance_from_center = params['distance_from_center']

    # Calculate the distance from each border
    distance_from_border = 0.5 * track_width - distance_from_center

    # Reward higher if the car stays inside the track borders
    if distance_from_border >= 0.05:
        reward = 1.0
    else:
        reward = 1e-3 # Low reward if too close to the border or goes off the track

    return float(reward)

x, y

Type: float
范围：0:N
包含赛道的模拟环境的沿 x 和 y 轴的代理中心位置（以米为单位）。原点位于模拟环境的左下角。
在这里插入图片描述

waypoints

类型：[float, float] 的 list
范围：[[xw,0,yw,0] … [xw,Max-1, yw,Max-1]]
沿赛道中心排列、取决于赛道的 Max 里程的有序列表。每个里程碑均由 (xw,i, yw,i)。对于环形赛道，第一个路径点与最后一个路径点相同。对于直道和其他非环形赛道，第一个路径点与最后一个路径点不同。
在这里插入图片描述

在这里插入图片描述

总结

灵活运用给定的参数能够使的代理快速完成赛道，这些是每个参数的基础介绍，想要使得一个策略能够快速收敛，需要组合使用这些参数，以及使用超参技巧，下一篇文章将会讲解每个超参的作用。

标签：赛道,center,track,float,入口,详解,params,005,reward
来源： https://blog.csdn.net/qq_37608398/article/details/120681768