zl程序教程

您现在的位置是:首页 >  工具

当前栏目

强化学习笔记:Sutton-Book第三章习题解答(Ex1~Ex16)

笔记学习 习题 强化 解答 第三章 Book
2023-09-14 09:15:00 时间

目录

前言

Exercise 3.1

Exercise 3.2

Exercise 3.3

Exercise 3.4

Exercise 3.5

Exercise 3.6

Exercise 3.7

Exercise 3.8

Exercise 3.9

Exercise 3.10

Exercise 3.11

Exercise 3.12

Exercise 3.13

Exercise 3.14

Exercise 3.15¶

Exercise 3.16


前言

        本文为Sutton-RLBook第2版第三章的习题解答的前半部分(by myself,not standard,所以不能保证正确性^-^, 仅供参考,欢迎一起讨论学习)。有些问题还没有完成,有待补充。

Exercise 3.1

Devise three example tasks of your own that fit into the MDP framework,identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.

解答:

        MDP有5个要素:智能体,状态,行动,奖励和回报。还有一个根本的特性是马尔科夫性,即当前状态包含过去所有的智能体与环境之间的交互的信息,或者说包含了智能体做进一步决策所需要的信息。根据这个特性(假设),MDP的动力学机制(dynamics)由概率函数决定如下:

        gif.latex?p%28s%27%2Cr%7Cs%2Ca%29%20%5Cdoteq%20Pr%28S_%7Bt+1%7D%3Ds%27%2C%20R_%7Bt+1%7D%3Dr%7CS_%7Bt%7D%3Ds%27%2C%20A_%7Bt%7D%3Da%29

        例1. 人工智能闯迷宫游戏。

        在每个格点处,智能体需要决定是往哪个方向(前后左右)前进,只需要根据当前所处位置即可能前进的方向做出决策,满足马尔科夫性。智能体的回报体现在最终找到出口可以得到的最终奖励,而中间每走一步则会得到一个负的奖励(惩罚),这样,智能体为了得到更高的奖励,必须学会以最少的步数找到出口。当然,智能体的决策机制可以设计的比较有个性,比如说,坚持事不过三的原则,除非迫不得已,不连续三次沿同一方向前进。这样的话,智能体需要记住过去的决策信息用于当前决策,这样就破坏了马尔科夫性,不算是MDP。

        例2:人工智能拧魔方

        这可能不是一个很好的例子。因为魔方可能具有确定性的解。而已经习得了魔方解法的人工智能可以按照确定性的方式还原魔方。这里假定对于魔方没有先验知识的人工智能。

 

Exercise 3.2

Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

SOLUTION

Exercise 3.3

Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine.Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

Exercise 3.4

Give a table analogous to that in Example 3.3, but for p(s', r|s, a). It should have columns for s, a, s', r, and p(s', r|s, a), and a row for every 4-tuple for which p(s', r|s, a) > 0.

SOLUTION

下图右侧为Example3.3中的状态转移图,本题的要求是基于同样的状态转移图给出p(s', r|s, a) ~ (s, a, s',r)的表格。

 watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_20,color_FFFFFF,t_70,g_se,x_16

老老实实地对着上图右侧的状态转移图进行列表当然可行。但是。。。其实题目所要求的表格就是将上图左侧的图的最右一列挪到p列的左侧(由输出项变为输入项即可!)由此可得题目所要求得图如下所示(当然第4列要改为p(s',r|s,a),而且要将p(s',r|s,a)=0的行删除):

watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_20,color_FFFFFF,t_70,g_se,x_16

Exercise 3.5

The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).

SOLUTION

        对于连续性任务:

        gif.latex?%5Csum%5Climits_%7Bs%27%5Cin%5Cmathcal%7BS%7D%7D%5Csum%5Climits_%7Br%5Cin%5Cmathcal%7BR%7D%7Dp%28s%27%2C%20r%7Cs%2C%20a%29%20%3D%201%2C%20%5Cforall%20s%5Cin%20%5Cmathcal%7BS%7D%2C%20and%5C%20a%5Cin%5Cmathcal%7BA%7D%20%5Ccdots%283.3%29

        对于回合制任务,结束状态(terminating state)需要特别考虑。

        在回合制任务种,终止状态gif.latex?S_T(Here, T refers to Termination)不是一个一般的状态,终止状态后没有动作,当然也不在发生状态转移。所以,在回合制任务中,通常除终止状态以外的所有状态构成的集合(或者说状态空间)记为gif.latex?%5Cmathcal%7BS%7D,加上终止状态后则记为gif.latex?%5Cmathcal%7BS%5E+%7D。所以针对回合制任务上式应该修改为:

        gif.latex?%5Csum%5Climits_%7Bs%27%5Cin%5Cmathcal%7BS%5E+%7D%7D%5Csum%5Climits_%7Br%5Cin%5Cmathcal%7BR%7D%7Dp%28s%27%2Cr%7Cs%2Ca%29%3D1%2C%5C%3B%5Cforall%20s%5Cin%5Cmathcal%7BS%7D%2C%20and%20%5C%3B%20a%20%5Cin%20%5Cmathcal%7BA%7D%20%5Cqquad%20%5Ccdots%20%283.3+%29

 

Exercise 3.6

Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for −1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?

SOLUTION

        "the return be at each time"指的是每个回合的预期回报,根据题设条件,考虑折扣系数𝛾,这个预期回报应该为gif.latex?G_t%20%3D%20-%5Cgamma%5E%7BK-1%7D, 其中K就是每个回合所持续的步数。而如果把这个任务看作是连续性任务的话,则预期回报为:

gif.latex?G_t%20%3D%20%5Csum%5Climits_%7Bj%7D-%5Cgamma%5E%7BK_j%20-%201%7D

        其中𝐾𝑗表示第j次失败距离当前时刻的时间步数。要注意这里的一个根本的差异是在回合制的观点中,每个回合有一个预期回报序列,各回合的预期回报序列是相互独立的。而在连续制观点中,只有一个预期回报。在回合制观点中,第一个回合的G1和第二个回合的G1可能相等(比如说,两个回合的持续时间相同),但是在连续性观点中,时间指标只有一个,不同时间指标的回报是不同的。

 

Exercise 3.7

Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

SOLUTION

因为只有最终掏出迷宫的奖励为1,而中间步骤的奖励都为0,而且在回合制任务中的回报计算中没有折扣处理,或者说折扣系数为1,因此每个回合的总回报恒定为1,不管机器人跑多久(假定不管使用多久的时间它终究会找到出口)。所以机器人无法从回合总体回报得到有效的反馈信息。

要解决这个问题,有两种可能的方案。第一种方案是导入小于1的折扣系数;第二种方案是给中间步骤每一步一个较小的负的奖励。不管哪种方案,都使得到找到出口所用的步数越长所得到的总体回报就越小。这样,机器人就能从中学到要想得到更大的回报,就必须尽快找到出口。

Exercise 3.8

Suppose 𝛾 = 0.5 and the following sequence of rewards is received R1 = −1,R2 = 2, R3 = 6, R4 = 3, and R5 = 2, with T = 5. What are G0, G1, . . ., G5? Hint: Work backwards.

SOLUTION

        如提示所说,先计算G5然后再根据递推关系反向回推会比较简单。由于T = 5(表示Terminating state),所以显然有G5=0.

        𝐺4=𝑅5+𝛾𝐺5=2

        𝐺3=𝑅4+𝛾𝐺4=3+0.5∗2=4

        𝐺2=𝑅3+𝛾𝐺3=6+0.5∗4=8

        𝐺1=𝑅2+𝛾𝐺2=2+0.5∗8=6

        𝐺0=𝑅1+𝛾𝐺1=−1+0.5∗6=2

Exercise 3.9

Suppose 𝛾 = 0.9 and the reward sequence is R1 = 2 followed by an infinite sequence of 7s. What are G1 and G0?

SOLUTION

gif.latex?G0%20%3D%20R1%20+%20%5Cgamma%20R2%20+%20%5Cgamma%5E2%20R3%20+%20...%20%3D%202%20+%207%20%5Ccdot%20%5Csum%5Climits_%7Bk%3D1%7D%5Climits%5E%7B%5Cinfty%7D%5Cgamma%5Ek%20%3D%202%20+%207%5Cfrac%7B%5Cgamma%7D%7B1-%5Cgamma%7D%20%3D%2065

gif.latex?G1%20%3D%20R2%20+%20%5Cgamma%20R3%20+%20%5Cgamma%5E2%20R4%20+%20...%20%3D%207%20%5Ccdot%20%5Csum%5Climits_%7Bk%3D0%7D%5Climits%5E%7B%5Cinfty%7D%5Cgamma%5Ek%20%3D%207%5Cfrac%7B1%7D%7B1-%5Cgamma%7D%20%3D%2070

验算一下它们的确满足递推关系:gif.latex?G0%20%3D%2065%20%3D%202%20+%200.9%20*%2070%20%3D%20R1%20+%20%5Cgamma%20G1

Exercise 3.10

Prove the second equality in (3.10): gif.latex?G_t%3D%5Csum%5Climits_%7Bk%3D0%7D%5Climits%5E%7B%5Cinfty%7D%5Cgamma%5E%7Bk%7D%3D%5Cfrac%7B1%7D%7B1-%5Cgamma%7D

Prove

        gif.latex?f%28N%29%20%3D%20%5Csum%5Climits_%7Bk%3D0%7D%5Climits%5E%7BN%7D%5Cgamma%5E%7Bk%7D%20%3D%20%5Cfrac%7B1-%5Cgamma%5E%7BN%7D%7D%7B1-%5Cgamma%7D

        gif.latex?G_t%20%3D%20%5Clim%5Climits_%7BN%20%5Crightarrow%20%5Cinfty%7Df%28N%29%20%3D%20%5Clim%5Climits_%7BN%20%5Crightarrow%20%5Cinfty%7D%5Cfrac%7B1-%5Cgamma%5E%7BN%7D%7D%7B1-%5Cgamma%7D%20%3D%20%5Cfrac%7B1%7D%7B1-%5Cgamma%7D

Exercise 3.11

If the current state is 𝑆𝑡, and actions are selected according to a stochastic policy 𝜋, then what is the expectation of gif.latex?R_%7Bt+1%7D in terms of 𝜋π and the four-argument function p (3.2)?

SOLUTION

        式(3.2)(sutton-book原书编号)如下所示:

        gif.latex?p%28s%27%2C%20r%7Cs%2C%20a%29%20%5Cdoteq%20Pr%28S_t%3Ds%27%2CR_t%3Dr%20%7C%20S_%7Bt-1%7D%3Ds%2CA_%7Bt-1%7D%3Da%29

                                gif.latex?%5Cforall%20s%2Cs%27%5Cin%20%5Cmathcal%7BS%7D%2C%20r%5Cin%5Cmathcal%7BR%7D%2C%20and%5C%20a%5Cin%5Cmathcal%7BA%7D%28s%29%20%5Cquad%5Ccdots%283.2%29

 

        gif.latex?R_%7Bt+1%7D的期望可以如下计算而得: 

         gif.latex?%5Cmathbb%7BE%7D_%7B%5Cpi%7D%5BR_%7Bt+1%7D%5D%20%3D%20%5Csum%5Climits_%7Br%5Cin%5Cmathcal%7BR%7D%7Dr%5Csum%5Climits_%7Bs%27%5Cin%5Cmathcal%7BS%7D%7Dp%28s%27%2C%20r%7Cs%2C%20a%29%2C%5Cquad%20%5Cforall%20s%5Cin%5Cmathcal%7BS%7D%2C%20a%5Cin%5Cmathcal%7BA%7D%28s%29

Exercise 3.12

Give an equation for gif.latex?v_%5Cpi in terms of gif.latex?q_%5Cpi and 𝜋.

SOLUTION

        根据定义,gif.latex?v_%5Cpi%28s%29是在状态s下采取所有各种可能的动作所得到的预期回报,而gif.latex?q_%5Cpi%28s%2Ca%29是在状态s下采取特定动作a时所得到的预期回报,因此gif.latex?v_%5Cpi%28s%29可以看作是在状态s条件下所采取的所有可能的动作a的gif.latex?q_%5Cpi%28s%2Ca%29的期望,由此可得:

        gif.latex?v_%5Cpi%28s%29%20%3D%20%5Cmathbb%7BE%7D_%5Cpi%5Bq_%5Cpi%28s%2Ca%29%5D%20%3D%20%5Csum%5Climits_%7Ba%5Cin%5Cmathcal%7BA%7D%28s%29%7D%5Cpi%28a%7Cs%29%5Ccdot%20q_%5Cpi%28s%2Ca%29 

Exercise 3.13

Give an equation for 𝑞𝜋qπ in terms of 𝑣𝜋vπ and the four-argument p.

SOLUTION

第一感是有点懵,不像Exercise3.12那么直观。但是,借助于如下所示的backup diagram的帮助的话,问题就会显得非常清晰。当然Exercise3.12也可以画出相应的backup diagram来。

watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_20,color_FFFFFF,t_70,g_se,x_16

        要计算从(s,a)出发所得到预期回报𝑞𝜋(𝑠,𝑎)qπ(s,a),如上图所示,就是要求沿各条支路出发所得的预期回报的概率加权平均。

        最左一条支路表示从(s,a)出发以p(s1,r1|s,a)的概率会到达状态s1,并得到即时奖励r1。我们直到从s1状态出发所得的预期回报为gif.latex?v_%5Cpi%28s1%29,因此这一条支路的总的预期回报(不要忘了折扣处理)为gif.latex?r1%20+%20%5Cgamma%20v_%5Cpi%28s1%29。其它各条支路同理。需要注意的是,上图中每一条支路并不总是sk与rk配对,每一条支路代表的是(𝑆𝑡+1=𝑠𝑘,𝑅𝑡+1=𝑟𝑗)的二元组所代表的情况。因此假设有M个状态,有K种即时奖励,则从(s,a)出发总共有𝑀∗𝐾条可能的支路。

        因此可得:

        gif.latex?q_%5Cpi%28s%2Ca%29%20%3D%20%5Csum%5Climits_%7Bs%27%7D%5Csum%5Climits_%7Br%7Dp%28s%27%2Cr%7Cs%2Ca%29%28r%20+%20%5Cgamma%20v_%5Cpi%28s%27%29%29

        当然,对照状态值函数的贝尔曼方程(3.14)与Exercise3.12的方程,其实也可以直接写出上式。

Exercise 3.14

The Bellman equation (3.14) must hold for each state for the value function 𝑣𝜋vπ shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, −0.4, and +0.7. (These numbers are accurate only to one decimal place.)

watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBA56yo54mb5oWi6ICV,size_20,color_FFFFFF,t_70,g_se,x_16

SOLUTION

        记中央状态及其上下左右格子(所代表的状态)分别为c,u,d,r,l。 由Example3.5的描述可知,在任何(格子)状态采取上下左右的行动的概率均为1/4,                

        gif.latex?%5Cpi%28up%7Cc%29%20%3D%20%5Cpi%28down%7Cc%29%20%3D%20%5Cpi%28right%7Cc%29%20%3D%20%5Cpi%28left%7Cc%29%20%3D%200.25

        又由上图右侧可知,在(s,a)的条件下,下一个状态都是确定性的,即:

gif.latex?p%28u%2C2.3%7Cc%2Cup%29%20%3D%20p%28d%2C-0.4%7Cc%2Cdown%29%20%3D%20p%28l%2C0.7%7Cc%2Cleft%29%20%3D%20p%28r%2C0.4%7Cc%2Cright%29%20%3D%201.0

        而且,从center出发到上下左右的即时奖励均为0,由此可得:

gif.latex?%5Cbegin%7Balign%7Dv_%5Cpi%28c%29%20%26%3D%20%5Csum%5Climits_%7Ba%5Cin%20%5Cmathcal%7BA%7D%7D%5Cpi%28a%7Cs%29%5Csum%5Climits_%7Bs%27%5Cin%20%5Cmathcal%7BS%7D%2Cr%5Cin%20%5Cmathcal%7BR%7D%7Dp%28s%27%2Cr%7Cs%2Ca%29%5Cbig%5Br+%5Cgamma%20v_%5Cpi%28s%27%29%20%5Cbig%5D%20%5C%5C%26%3D%200.25%5Ctimes0.9%5Ctimes%282.3+0.4+0.7-0.4%29%20%5C%5C%26%3D%200.25%5Ctimes0.9%5C3.0%20%5C%5C%26%3D%200.675%20%5Cend%7Balign%7D

        四舍五入后即得0.7.

Exercise 3.15

In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant c to all the rewards adds a constant, gif.latex?v_c, to the values of all states, and thus does not affect the relative values of any states under any policies. What is gif.latex?v_c in terms of c and 𝛾?

 

Solution

(1) 只有它们之间的差值有关系。

        如(2)中的结论所示,给即时奖励加一个常数不改变结果。而通过加一个常数可以改变奖励的符号。

(2) 证明:

        记给每次即时奖励增加一个常数c后所得的回报为gif.latex?G_t%5Ec, 由(3.8)可得,

                gif.latex?%5Cbegin%7Balign%7DG_t%5Ec%20%26%3D%20%5Csum%5Climits_%7Bk%3D0%7D%5Climits%5E%7B%5Cinfty%7D%5Cgamma%5E%7Bk%7D%20%28c+R_%7Bk+t+1%7D%29%20%5C%5C%26%3D%20%5Csum%5Climits_%7Bk%3D0%7D%5Climits%5E%7B%5Cinfty%7D%5Cgamma%5E%7Bk%7Dc%20+%20%5Csum%5Climits_%7Bk%3D0%7D%5Climits%5E%7B%5Cinfty%7D%5Cgamma%5E%7Bk%7D%20R_%7Bk+t+1%7D%20%5C%5C%20%26%3D%20G_t%20+%20%5Cfrac%7B1%7D%7B1-%5Cgamma%7D%20%2C%5Cquad%200%3C%5Cgamma%3C1%2C%20%5Cend%7Balign%7D

        代入状态值函数(同样改记为gif.latex?v%5Ec)可得:

                gif.latex?%5Cbegin%7Balign%7Dv%5Ec%28s%29%20%26%3D%20%5Cmathbb%7BE%7D%5Cbig%5B%20G_t%5Ec%20%7C%20S_t%20%3D%20s%5Cbig%5D%20%5C%5C%26%3D%20%5Cmathbb%7BE%7D%5Cbig%5B%20G_t%20%7C%20S_t%20%3D%20s%5Cbig%5D%20+%20%5Cmathbb%7BE%7D%5Cbig%5B%20%5Cfrac%7B1%7D%7B1-%5Cgamma%7D%20%7C%20S_t%20%3D%20s%5Cbig%5D%20%5C%5C%20%26%3D%20v%28s%29%20+%20%5Cfrac%7B1%7D%7B1-%5Cgamma%7D%20%5C%5C%20v_c%20%26%3D%20%5Cfrac%7B1%7D%7B1-%5Cgamma%7D%5Cend%7Balign%7D​​​​​​​

        以上结果表明给即时奖励加一个常数不影响结果。

        订正:以上gif.latex?%5Cfrac%7B1%7D%7B1-%5Cgamma%7D均应为gif.latex?%5Cfrac%7Bc%7D%7B1-%5Cgamma%7D.

Exercise 3.16

Now consider adding a constant c to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example. 

Solution:

        首先,假定原题的意思是在回合制中不考虑折扣因子(或者说折扣因子gif.latex?%5Cgamma%20%3D%201)的影响(原因在于,回合制中依然可以采用小于1的折扣因子,而考虑折扣因子对于问题的答案有影响)。

        由于不考虑折扣因子,对所有即时奖励都加上一个常数项c的话会得到t时刻的回报如下式所示:

        gif.latex?G_t%5Ec%20%3D%20%28R_%7Bt+1%7D+c%29%20+%20...%20+%20%28R_T+c%29%20%3D%20G_t%20+%20%28T-t%29c

        考虑一个简单的情况,到达终点的奖励为正,其它情况奖励为r<0。

  1. 如果c足够大使得c+r>0,迷宫跑者一直没有找到出口一直在兜圈,他所得到的回报将持续增长直到无限大!而在有限时间内找到出口所得的奖励显然只是有限值(假定每次即时奖励都为有限值),这意味着智能体应该学会尽量回避出口,拖得越久回报越高!
  2. 如果c+r=0,只要智能体最终找到终点,不管花多少时间,所得回报都一样。
  3. 如果c-r<0,智能体每都耗一步都会受到惩罚。

        case1,case2显然改变了游戏规则(case1: 智能体会学会尽量拖延; case2: 无所谓随便乱撞即可,意味着智能体什么也学不到),而case3则对智能体的行为没有影响,智能体依然会学会要尽快找到出口。

 

本学习笔记总目录:强化学习笔记总目录https://chenxiaoyuan.blog.csdn.net/article/details/121715424

Sutton-RLbook(第2版)第3章练习后半部分参见: 强化学习笔记:Sutton-Book第三章习题详解(Ex17~Ex29)https://chenxiaoyuan.blog.csdn.net/article/details/122898572​​​​​​​