zl程序教程

您现在的位置是:首页 >  硬件

当前栏目

机器学习笔记之马尔可夫链蒙特卡洛方法(三)MH采样算法

机器方法算法笔记学习 采样 马尔可夫 蒙特卡洛
2023-09-11 14:15:53 时间

引言

上一节介绍了马尔可夫链(Markov Chain)以及平稳分布。本节将马尔可夫链与蒙特卡洛方法相结合,介绍MH采样算法

回顾:马尔可夫链与平稳分布

马尔可夫链

基于齐次马尔可夫假设一阶齐次马尔可夫假设 为例, t + 1 t+1 t+1时刻的随机变量 X t + 1 \mathcal X_{t+1} Xt+1 t t t时刻的随机变量 X t \mathcal X_t Xt之间的关联关系如下:
P ( X t + 1 ∣ X t , X t − 1 , ⋯   , X 1 ) = P ( X t + 1 ∣ X t ) P(\mathcal X_{t+1} \mid \mathcal X_t,\mathcal X_{t-1},\cdots,\mathcal X_1) = P(\mathcal X_{t+1} \mid \mathcal X_t) P(Xt+1Xt,Xt1,,X1)=P(Xt+1Xt)
针对马尔可夫链任一时刻的随机变量 X t \mathcal X_t Xt可选择的结果均是离散的 这种情况,定义共包含 K \mathcal K K种选择方式, t t t时刻随机变量 X t \mathcal X_t Xt的概率分布 π ( X t ) \pi(\mathcal X_t) π(Xt)表示如下:
π ( X t ) = [ π ( X t = x 1 ) , π ( X t = x 2 ) , ⋯   , π ( X t = x K ) ] K × 1 T \pi(\mathcal X_t) = \left[\pi(\mathcal X_t = x_1),\pi(\mathcal X_t = x_2),\cdots,\pi(\mathcal X_t = x_{\mathcal K})\right]^{T}_{\mathcal K \times 1} π(Xt)=[π(Xt=x1),π(Xt=x2),,π(Xt=xK)]K×1T
对应地,该马尔可夫链状态转移矩阵是一个 K × K \mathcal K \times \mathcal K K×K的方阵;并且矩阵中的每一个元素 a i j ∈ A a_{ij} \in \mathcal A aijA均表示基于转移过程选择的条件概率
A = [ a i j ] K × K i , j ∈ { 1 , 2 , ⋯   , K } a i j = P ( X t + 1 = x j ∣ X t = x i ) t = 1 , 2 , ⋯ \mathcal A = [a_{ij}]_{\mathcal K \times \mathcal K} \quad i,j \in \{1,2,\cdots,\mathcal K\} \\ a_{ij} = P(\mathcal X_{t+1} = x_j \mid \mathcal X_t = x_i) \quad t=1,2,\cdots A=[aij]K×Ki,j{1,2,,K}aij=P(Xt+1=xjXt=xi)t=1,2,

平稳分布

平稳分布(Stationary Distribution)表示马尔可夫链具有某种平稳性质的概率分布。平稳分布存在的底层逻辑在于 状态转移矩阵是一个双随机矩阵

  • 状态转移矩阵是一个 K × K \mathcal K \times \mathcal K K×K的方阵;
  • 状态转移矩阵各行、列元素之和为1;

基于双随机矩阵的性质,得到状态转移矩阵的特征值均 ≤ 1 \leq1 1恒成立
通过证明,只要马尔可夫链的状态转移矩阵 确定的条件下,在未来的转移过程中必然会逼近至平稳分布
具体证明过程请看上一节内容~传送门

假设 m m m时刻状态下达到平稳分布,则有:
π ( X m ) = π ( X m + 1 ) = π ( X m + 2 ) = ⋯ \pi(\mathcal X_m) = \pi(\mathcal X_{m+1}) = \pi(\mathcal X_{m+2}) = \cdots π(Xm)=π(Xm+1)=π(Xm+2)=

如何判定当前时刻的分布是否为平稳分布?需要借助细致平衡原则(Detail Balance):
π ( X = x i ) ⋅ P ( X = x ∗ ∣ X = x ) = π ( X = x ∗ ) ⋅ P ( X = x ∣ X = x ∗ ) \pi(\mathcal X = x_i) \cdot P(\mathcal X = x^* \mid \mathcal X = x) = \pi(\mathcal X = x^*) \cdot P(\mathcal X = x \mid \mathcal X = x^*) π(X=xi)P(X=xX=x)=π(X=x)P(X=xX=x)
细致平衡是概率分布是平稳分布充分非必要条件,因此可以通过细致平衡去判别平稳分布,反之不然。

MH采样算法

蒙特卡洛方法介绍中提到,推断(Inference)关心的问题是后验概率 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)的结果,或者是关于 P ( Z ∣ X ) P(\mathcal Z\mid \mathcal X) P(ZX)的期望信息:
E Z ∣ X [ f ( Z ) ] = ∫ Z ∣ X P ( Z ∣ X ) f ( Z ) d Z \mathbb E_{\mathcal Z \mid \mathcal X} [f(\mathcal Z)] = \int_{\mathcal Z \mid\mathcal X} P(\mathcal Z \mid \mathcal X)f(\mathcal Z) d\mathcal Z EZX[f(Z)]=ZXP(ZX)f(Z)dZ
通过蒙特卡洛方法,从概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)中进行采样,再进行计算:
z ( 1 ) , z ( 2 ) , ⋯   , z ( N ) ∼ P ( Z ∣ X ) E Z ∣ X [ f ( Z ) ] = 1 N ∑ i = 1 N f ( z ( i ) ) z^{(1)},z^{(2)},\cdots,z^{(N)} \sim P(\mathcal Z \mid \mathcal X) \\ \mathbb E_{\mathcal Z \mid \mathcal X} [f(\mathcal Z)] = \frac{1}{N} \sum_{i=1}^{N} f(z^{(i)}) z(1),z(2),,z(N)P(ZX)EZX[f(Z)]=N1i=1Nf(z(i))
但真实情况是: P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)中采集样本是非常复杂的事情。因此借助马尔可夫链(Markov Chain)来间接获取样本信息。

采样思路

马尔可夫链是如何实现采样的?
假设已经得到一个状态转移矩阵 A ∗ \mathcal A^* A A ∗ \mathcal A^* A满足:在一阶齐次马尔可夫假设的条件下,任意两个连续状态下的随机变量 X \mathcal X X对应的概率分布,均满足细致平衡原则

换句话说, A ∗ \mathcal A^* A使马尔可夫链 { X T } \{\mathcal X_{T}\} {XT}共用同一个概率分布,也就是平稳分布

  • 此时,给定一个初始节点:
    X i n i t = x i ( i = 1 , 2 , ⋯   , K ) \mathcal X_{init} = x_i \quad (i=1,2,\cdots,\mathcal K) Xinit=xi(i=1,2,,K)
  • 通过状态转移过程,随机得到下一状态随机变量的选择结果 x j x_j xj
    x j ∼ P ( X 1 ∣ X i n i t = x i ) x_j \sim P(\mathcal X_1 \mid \mathcal X_{init} = x_i) xjP(X1Xinit=xi)
  • 同上,根据上一时刻的状态转移过程,随机得到下一状态随机变量的选择结果 x k x_k xk
    x k ∼ P ( X 2 ∣ X 1 = x j ) x_k \sim P(\mathcal X_2 \mid \mathcal X_1 = x_j) xkP(X2X1=xj)

以此类推,最终会得到这样一组样本点集合。这些样本点均服从平稳分布
{ x i , x j , x k , ⋯   } \{x_i,x_j,x_k,\cdots\} {xi,xj,xk,}

MH采样算法过程

基于上述采样思路,将获取平稳分布问题转化为:如何找到一个恰当的状态转移矩阵,使得马尔可夫链的各分布是平稳分布

  • 首先,构建关于 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)马尔可夫链随机构建一个状态转移矩阵 Q \mathcal Q Q
    Q = [ q i j ] K × K ( i , j ∈ { 1 , 2 , ⋯   , K } ) \mathcal Q = [q_{ij}]_{\mathcal K \times \mathcal K} \quad (i,j \in \{1,2,\cdots,\mathcal K\}) Q=[qij]K×K(i,j{1,2,,K})
    此时关于细致平衡的等式两项有:
    不能说是一定不相等,而是相等的概率极低。任意随机一个 Q \mathcal Q Q就能得到平稳分布,那运气可太好了~
    P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) ≠ P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) P(\mathcal Z =z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z) \neq P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*) P(Z=zX)Q(Z=zZ=z)=P(Z=zX)Q(Z=zZ=z)

  • 基于上述式子,假设存在关于 z , z ∗ z,z^* z,z的函数 α ( z , z ∗ ) \alpha(z,z^*) α(z,z),使得
    P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) ⋅ α ( z , z ∗ ) = P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) ⋅ α ( z ∗ , z ) P(\mathcal Z =z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z) \cdot \alpha(z,z^*)= P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*) \cdot \alpha(z^*,z) P(Z=zX)Q(Z=zZ=z)α(z,z)=P(Z=zX)Q(Z=zZ=z)α(z,z)
    我们将 Q ( Z = z ∗ ∣ Z = z ) ⋅ α ( z , z ∗ ) Q(\mathcal Z = z^* \mid \mathcal Z = z) \cdot \alpha(z,z^*) Q(Z=zZ=z)α(z,z)记作 P ( Z = z ∗ ∣ Z = z ) \mathcal P(\mathcal Z = z^* \mid \mathcal Z = z) P(Z=zZ=z)
    反之, Q ( Z = z ∣ Z = z ∗ ) ⋅ α ( z ∗ , z ) \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*) \cdot \alpha(z^*,z) Q(Z=zZ=z)α(z,z)记作 P ( Z = z ∣ Z = z ∗ ) \mathcal P(\mathcal Z = z \mid \mathcal Z = z^*) P(Z=zZ=z)。上述公式可转化为:
    只是一个符号的变换, P \mathcal P P P P P不是同一个符号~
    P ( Z = z ∣ X ) ⋅ P ( Z = z ∗ ∣ Z = z ) = P ( Z = z ∗ ∣ X ) ⋅ P ( Z = z ∣ Z = z ∗ ) P(\mathcal Z = z \mid \mathcal X) \cdot \mathcal P(\mathcal Z = z^* \mid \mathcal Z = z) = P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal P(\mathcal Z = z \mid \mathcal Z = z^*) P(Z=zX)P(Z=zZ=z)=P(Z=zX)P(Z=zZ=z)
    此时,上述公式满足细致平衡原则,此时的分布是平稳分布

  • 回溯上述过程,我们称 α ( z , z ∗ ) \alpha(z,z^*) α(z,z)函数为接收率,其具体表示如下:
    α ( z , z ∗ ) = min ⁡ [ 1 , P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) ] \alpha(z,z^*) = \mathop{\min}\left[1,\frac{P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*)}{P(\mathcal Z = z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z)}\right] α(z,z)=min[1,P(Z=zX)Q(Z=zZ=z)P(Z=zX)Q(Z=zZ=z)]
    对接受率函数 α ( z , z ∗ ) \alpha(z,z^*) α(z,z)进行分析:

    • 观察 α ( z , z ∗ ) \alpha(z,z^*) α(z,z),因为 Q ( Z = z ∣ Z = z ∗ ) \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*) Q(Z=zZ=z) Q ( Z = z ∣ Z = z ∗ ) \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*) Q(Z=zZ=z)条件概率,因此分数项大于等于0恒成立。从而 α ( z , z ∗ ) \alpha(z,z^*) α(z,z)的值域为:
      α ( z , z ∗ ) ∈ [ 0 , 1 ] \alpha(z,z^*) \in [0,1] α(z,z)[0,1]
    • α ( z , z ∗ ) \alpha(z,z^*) α(z,z)代入上式中,有:
      P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) ⋅ α ( z , z ∗ ) = P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) ⋅ min ⁡ [ 1 , P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) ] = min ⁡ [ P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) , P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) ] = P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) ⋅ min ⁡ [ 1 , P ( Z = z ∣ X ) ⋅ Q ( Z = z ∗ ∣ Z = z ) P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) ] = P ( Z = z ∗ ∣ X ) ⋅ Q ( Z = z ∣ Z = z ∗ ) ⋅ α ( z ∗ , z ) \begin{aligned} & P(\mathcal Z =z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z) \cdot \alpha(z,z^*) \\ & = P(\mathcal Z =z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z) \cdot \mathop{\min}\left[1,\frac{P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*)}{P(\mathcal Z = z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z)}\right] \\ & = \min \left[P(\mathcal Z =z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z),P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*)\right] \\ & = P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z=z^*) \cdot \mathop{\min}\left[1,\frac{P(\mathcal Z = z \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z)}{P(\mathcal Z = z^* \mid \mathcal X)\cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*)}\right] \\ & = P(\mathcal Z = z^* \mid \mathcal X) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*) \cdot \alpha(z^*,z) \end{aligned} P(Z=zX)Q(Z=zZ=z)α(z,z)=P(Z=zX)Q(Z=zZ=z)min[1,P(Z=zX)Q(Z=zZ=z)P(Z=zX)Q(Z=zZ=z)]=min[P(Z=zX)Q(Z=zZ=z),P(Z=zX)Q(Z=zZ=z)]=P(Z=zX)Q(Z=zZ=z)min[1,P(Z=zX)Q(Z=zZ=z)P(Z=zX)Q(Z=zZ=z)]=P(Z=zX)Q(Z=zZ=z)α(z,z)

接收率思路即MH采样算法(Metropolis Hastings)。
它的具体算法表示如下:

  • 从(0,1)均匀分布中进行采样;
    u ∼ U ( 0 , 1 ) u \sim \mathcal U(0,1) uU(0,1)
  • 以上一时刻采样结果 z ( t − 1 ) z^{(t-1)} z(t1)确定的条件下,对状态转移矩阵当前时刻 z ( t ) z^{(t)} z(t)进行采样;
    此时,随机采出一个结果—— z ∗ z^* z,但不能直接作为 z ( t ) z^{(t)} z(t),需要满足后续的判断条件。
    z ∗ ∼ Q ( Z = z ∣ Z = z ( t − 1 ) ) z^* \sim \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^{(t-1)}) zQ(Z=zZ=z(t1))
  • 计算 α \alpha α函数的具体结果:
    α = min ⁡ [ 1 , P ( Z = z ∗ ) ⋅ Q ( Z = z ∣ Z = z ∗ ) P ( Z = z ) ⋅ Q ( Z = z ∗ ∣ Z = z ) ] \alpha = \min \left[1, \frac{P(\mathcal Z = z^*) \cdot \mathcal Q(\mathcal Z = z \mid \mathcal Z = z^*)}{P(\mathcal Z = z) \cdot \mathcal Q(\mathcal Z = z^* \mid \mathcal Z = z)}\right] α=min[1,P(Z=z)Q(Z=zZ=z)P(Z=z)Q(Z=zZ=z)]
  • u u u α \alpha α结果进行如下对比:
    类似于‘拒绝采样’,从0-1均匀分布中采样的 u u u自身没有任何实际意义,它只是用来衡量 α \alpha α结果的性能。
    • 如果: u ≤ α → z ( t ) = z ∗ u \leq \alpha \to z^{(t)} = z^* uαz(t)=z
      存在 α \alpha α的概率接收 z ∗ z^* z样本。
    • 否则: z ( t ) = z ( t − 1 ) z^{(t)} = z^{(t-1)} z(t)=z(t1)
      此次的采样被拒绝,依然使用上一时刻的样本结果 z ( t − 1 ) z^{(t-1)} z(t1)作为本时刻的输出样本。这次迭代确实白跑~样本数量没有减少,下次迭代依然将 z ( t − 1 ) z^{(t-1)} z(t1)作为条件,基于该条件的概率分布进行采样。
  • 最终会得到一系列样本:
    { z ( 1 ) , z ( 2 ) , ⋯   , z ( N ) } \{z^{(1)},z^{(2)},\cdots,z^{(N)}\} {z(1),z(2),,z(N)}
    并最后通过上述样本点使用蒙特卡洛方法近似求解概率分布 P ( Z ∣ X ) P(\mathcal Z \mid \mathcal X) P(ZX)的期望结果:
    E Z ∣ X [ f ( Z ) ] = 1 N ∑ i = 1 N f ( z ( i ) ) \mathbb E_{\mathcal Z \mid \mathcal X} [f(\mathcal Z)] = \frac{1}{N} \sum_{i=1}^{N} f(z^{(i)}) EZX[f(Z)]=N1i=1Nf(z(i))

下一节将介绍吉布斯采样算法(Gibbs)

相关参考:
机器学习-白板推导系列-蒙特卡洛方法3(Monte Carlo Method)-MH采样(Metropolis Hastings)