您现在的位置是：首页 > 工具

当前栏目

【强化学习】gym简介

学习简介强化 gym

2023-09-14 09:15:10 时间

什么是gym？

gym可以理解为一个仿真环境，里面内置了多种仿真游戏。比如，出租车游戏、悬崖游戏。不同的游戏所用的网格、规则、奖励(reward)都不一样，适合为强化学习做测试。同时，其提供了页面渲染，可以可视化地查看效果。

安装gym

pip install gym

gym的常用函数解释

生成仿真环境

gym.make(‘环境名’)
例如：选择Pong-v0这个环境
env = gym.make(‘Pong-v0’)

重置仿真环境

env.reset()
重置环境，回到初始状态。

渲染环境

env.render()
渲染当前环境，可视化显示

环境执行一步

env.step()：
step()用于执行一个动作，最后返回一个元组（observation, reward, done, info）
observation (object): 智能体执行动作a后的状态，也就是所谓的“下一步状态s’ ”
reward (浮点数) : 智能体执行动作a后获得的奖励
done (布尔值): 判断episode是否结束，即s’是否是最终状态？是，则done=True；否，则done=False。
info (字典): 一些辅助诊断信息（有助于调试，也可用于学习），一般用不到。

列出所有环境

envs

from gym import envs
names = [env.id for env in envs.registry.all()]
print('\n'.join(names))

案例：出租车问题

下面通过gym来生成并可视化出租车问题(Taxi-v2)

可视化环境

import gym

# 生成仿真环境
env = gym.make('Taxi-v2') # 这里若不存在Taxi-v2，可以改为Taxi-v3
# 重置仿真环境
obs = env.reset()
# 渲染环境当前状态
env.render()

在这里插入图片描述

列出状态数量和动作数量

m = env.observation_space.n  # size of the state space
n = env.action_space.n  # size of action space

print("出租车问题状态数量为{:d}，动作数量为{:d}。".format(m, n))

出租车问题状态数量为500，动作数量为6。

规则解释

图上有四个位置：R,G,B,Y，蓝色代表乘客当前位置(上车位置)，红色代表乘客目的地(下车位置)，这两个位置会在四个位置在中随机选取。黄色框代表出租车，它会在广场任意位置随机生成。图中，竖线代表不可穿越的墙壁，虚线代表可以穿过的马路。当出租车接上乘客，再将乘客送往目的地之后，游戏结束。

动作：
有6个离散的确定性动作：

0：向南移动
1：向北移动
2：向东移动
3：向西移动
4：乘客上车
5：乘客下车

奖励：
每次行动奖励-1，解除乘客安全奖励+20。非法执行“载客/落客”行为的，奖励-10。

颜色：

蓝色：乘客
洋红：目的地
黄色：空出租车
绿色：出租车满座

状态空间：
状态空间表示为：
（出租车行、出租车列、乘客位置、目的地）

英文官方解释：

The Taxi Problem
    from "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition"
    by Tom Dietterich

    Description:
    There are four designated locations in the grid world indicated by R(ed), B(lue), G(reen), and Y(ellow). When the episode starts, the taxi starts off at a random square and the passenger is at a random location. The taxi drive to the passenger's location, pick up the passenger, drive to the passenger's destination (another one of the four specified locations), and then drop off the passenger. Once the passenger is dropped off, the episode ends.

    Observations: 
    There are 500 discrete states since there are 25 taxi positions, 5 possible locations of the passenger (including the case when the passenger is the taxi), and 4 destination locations. 

MAP = [
    "+---------+",
    "|R: | : :G|",
    "| : : : : |",
    "| : : : : |",
    "| | : | : |",
    "|Y| : |B: |",
    "+---------+",
]    

    Actions: 
    There are 6 discrete deterministic actions:
    - 0: move south
    - 1: move north
    - 2: move east 
    - 3: move west 
    - 4: pickup passenger
    - 5: dropoff passenger
    
    Rewards: 
    There is a reward of -1 for each action and an additional reward of +20 for delievering the passenger. There is a reward of -10 for executing actions "pickup" and "dropoff" illegally.
    
    Rendering:
    - blue: passenger
    - magenta: destination
    - yellow: empty taxi
    - green: full taxi
    - other letters (R, G, B and Y): locations for passengers and destinations

    actions:
    - 0: south
    - 1: north
    - 2: east
    - 3: west
    - 4: pickup
    - 5: dropoff

    state space is represented by:
        (taxi_row, taxi_col, passenger_location, destination)

2022.4.10更

程序代码

Q学习实际上就是离轨策略的时序差分(TD)方法，相关的理论看参考本专栏的这篇博文【强化学习】迷宫寻宝：Sarsa和Q-Learning
完整代码：

import gym
import numpy as np

# 生成仿真环境
env = gym.make('Taxi-v3')
# 重置仿真环境
obs = env.reset()
# 渲染环境当前状态
env.render()

m = env.observation_space.n  # size of the state space
n = env.action_space.n  # size of action space

# Intialize the Q-table and hyperparameters
# Q表，大小为 m*n
Q = np.zeros([m, n])
# 回报的折扣率
gamma = 0.97
# 分幕式训练中最大幕数
max_episode = 1000
# 每一幕最长步数
max_steps = 100
# 学习率参数
alpha = 0.7
# 随机探索概率
epsilon = 0.3

for i in range(max_episode):
    # Start with new environment
    s = env.reset()
    done = False
    counter = 0
    for _ in range(max_steps):
        # Choose an action using epsilon greedy policy
        p = np.random.rand()
        # 请根据 epsilon-贪婪算法 选择动作 a
        # p > epsilon 或尚未学习到某个状态的价值时，随机探索
        # 其它情况，利用已经觉得的价值函数进行贪婪选择 (np.argmax)
        if p > epsilon:
            a = env.action_space.sample()  # 随机探索
        else:
            Q_list = Q[s, :]
            maxQ = np.max(Q_list)
            action_list = np.where(Q_list == maxQ)[0]  # maxQ可能对应多个action
            a = np.random.choice(action_list)
        # env.step(action) 根据所选动作action执行一步
        # 返回新的状态、回报、以及是否完成
        s_new, r, done, _ = env.step(a)
        # 请根据贝尔曼方程，更新Q表 (np.max)
        if done:
            Q[s, a] = (1 - alpha) * Q[s, a] + alpha * r  # 无下一个状态的情况
        else:
            Q[s, a] = (1 - alpha) * Q[s, a] + alpha * (r + gamma * np.max(Q[s_new, :]))
        print(Q[s, a], r)
        s = s_new
        if done:
            break

s = env.reset()
done = False
env.render()
# Test the learned Agent
for i in range(max_steps):
    a = np.argmax(Q[s,:])
    s, _, done, _ = env.step(a)
    env.render()
    if done:
        break


rewards = []
for _ in range(100):
    s = env.reset()
    done = False
    # Test the learned Agent
    for i in range(max_steps):
        a = np.argmax(Q[s,:])
        s, r, done, _ = env.step(a)
        rewards.append(r)
        if done:
            break
r_mean = np.mean(rewards)
r_var = np.var(rewards)

print("平均回报为{}，回报的方差为{}。".format(r_mean, r_var))
env.close()

猜你喜欢

MySQL优化详解（一）——硬件和系统优化
VS2019 文件链接，文件引用，共用一个代码文件
Java实现迷宫城堡(强连通图的判定)
Python连K8s报错: urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host=‘127.0.0.1‘, port=6443):
tomcat部署会碰到的问题
Linux/CentOS 服务安装/卸载，开机启动chkconfig命令详解|如何让MySQL、Apache开机启动？
Windbg使用详解
Atitit 编程范式之道 attilax著艾龙著 1. 编程范式与编程语言的关系是什么？1 2. LOP 面向语言编程（LOP, Language Oriented Programming
Elasticsearch指标收集Metricbeat
yum安装grafana
[Java Spring data] @Query @Param
开源大数据周刊-第10期
【图像处理】从点云数据中提取边界（识别和追踪）（Matlab代码实现）
【Groovy】MOP 元对象协议与元编程 ( 使用 Groovy 元编程进行函数拦截 | 重写 MetaClass#invokeMethod 方法实现函数拦截 | 实现函数调用转发 )
$().click()和$().on('click',function(){})的区别
【项目经验】DataTable与JSON之间的转换
C#设计模式——策略模式(Strategy Pattern)
传统大型国企云原生转型，如何解决弹性、运维和团队协同等问题
Windows逆向安全（一）之基础知识（四）

相关主题

Java学习之JSP篇
php学习
4.13学习笔记
机器学习算法
学习学习
opencv学习笔记
神经网络与深度学习
Python学习22：迭代
[机器学习] 集成学习
机器学习和统计学习
机器学习之深度学习
学习学习中
强化学习笔记
汇编学习-栈
HashMap学习
的日常学习三
深度学习理论
redis 学习
反射学习5

zl程序教程