zl程序教程

您现在的位置是:首页 >  硬件

当前栏目

机器学习笔记之高斯过程(四)高斯过程回归——基于函数空间角度的预测任务求解

机器笔记学习 函数 基于 过程 空间 任务
2023-09-11 14:15:53 时间

引言

上一节介绍了高斯过程回归从权重空间(Weight-Space)视角向函数空间(Function-Space)视角的转化过程。本节将介绍基于函数空间视角,对预测任务(Prediction)进行求解。

回顾:基于函数空间视角的表达

场景构建

给定数据集合 D a t a = { ( x ( i ) , y ( i ) ) } i = 1 N Data = \{(x^{(i)},y^{(i)})\}_{i=1}^N Data={(x(i),y(i))}i=1N,其中样本集合 X \mathcal X X标签集合 Y \mathcal Y Y表示如下:
X = ( x ( 1 ) , x ( 2 ) , ⋯   , x ( N ) ) N × p T x ( i ) ∈ R p ; i = 1 , ⋯   , N Y = ( y ( 1 ) , y ( 2 ) , ⋯   , y ( N ) ) N × 1 T y ( i ) ∈ R ; i = 1 , 2 , ⋯   , N \begin{aligned} \mathcal X & = (x^{(1)},x^{(2)},\cdots,x^{(N)})_{N \times p}^T \quad x^{(i)} \in \mathbb R^p;i=1,\cdots,N \\ \mathcal Y & = (y^{(1)},y^{(2)},\cdots,y^{(N)})_{N \times 1}^T \quad y^{(i)} \in \mathbb R;i=1,2,\cdots,N \end{aligned} XY=(x(1),x(2),,x(N))N×pTx(i)Rp;i=1,,N=(y(1),y(2),,y(N))N×1Ty(i)R;i=1,2,,N
具体任务是非线性回归,需要将样本的特征空间由当前的低维空间 p p p通过非线性转换转移至高维空间 q ( q ≫ p ) q(q \gg p) q(qp)
X ∈ R p → ϕ ( X ) ∈ R q \mathcal X \in \mathbb R^p \to \phi(\mathcal X) \in \mathcal R^q XRpϕ(X)Rq

权重空间视角(Weight-Space)观察预测任务

权重空间视角的本质是模型学习模型参数 W \mathcal W W本身,基于 W \mathcal W W后验概率分布 P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(WData)来求解给定未知样本 x ^ \hat x x^预测标签结果 y ^ \hat y y^
具体推导过程详见贝叶斯线性回归——推断任务推导过程
P ( W ∣ D a t a ) ∝ P ( Y ∣ W , X ) ⋅ P ( W ) P ( W ∣ D a t a ) ∼ N ( μ W , Σ W ) { μ W = A − 1 X T Y σ 2 Σ W = A − 1 A = X T X σ 2 + [ Σ p r i o r − 1 ] p × p \begin{aligned} \mathcal P(\mathcal W \mid Data) \propto \mathcal P(\mathcal Y \mid \mathcal W,\mathcal X) \cdot \mathcal P(\mathcal W) \\ \mathcal P(\mathcal W \mid Data) \sim \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) \quad \begin{cases} \mu_{\mathcal W} = \frac{\mathcal A^{-1}\mathcal X^T\mathcal Y}{\sigma^2} \\ \Sigma_{\mathcal W} = \mathcal A^{-1} \\ \mathcal A = \frac{\mathcal X^T\mathcal X}{\sigma^2} + [\Sigma_{prior}^{-1}]_{p \times p} \end{cases} \end{aligned} P(WData)P(YW,X)P(W)P(WData)N(μW,ΣW)μW=σ2A1XTYΣW=A1A=σ2XTX+[Σprior1]p×p

  • 其中 σ 2 \sigma^2 σ2表示线性模型 Y = W T X + ϵ ϵ ∼ N ( 0 , σ 2 ) \mathcal Y = \mathcal W^T \mathcal X + \epsilon \quad \epsilon \sim \mathcal N(0,\sigma^2) Y=WTX+ϵϵN(0,σ2)高斯噪声 ϵ \epsilon ϵ方差(一维随机变量);
  • Σ p r i o r \Sigma_{prior} Σprior表示 W \mathcal W W先验概率分布 P ( W ) ∼ N ( 0 , Σ p r i o r ) \mathcal P(\mathcal W) \sim \mathcal N(0,\Sigma_{prior}) P(W)N(0,Σprior)协方差矩阵

此时,如果针对非线性回归任务,基于 X → ϕ ( X ) \mathcal X \to \phi(\mathcal X) Xϕ(X),对应的后验概率分布跟着发生变化:
注意的点:先验分布的协方差矩阵 Σ p r i o r \Sigma_{prior} Σprior也跟着变化为 q × q q \times q q×q.
P ( W ∣ D a t a ) ∼ N ( μ W , Σ W ) { μ W = A − 1 [ ϕ ( X ) ] T ⋅ Y σ 2 Σ W = A − 1 A = [ ϕ ( X ) ] T ϕ ( X ) σ 2 + [ Σ p r i o r ] q × q \mathcal P(\mathcal W \mid Data) \sim \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) \quad \begin{cases} \mu_{\mathcal W} = \frac{\mathcal A^{-1}[\phi(\mathcal X)]^T \cdot \mathcal Y}{\sigma^2} \\ \Sigma_{\mathcal W} = \mathcal A^{-1} \\ \mathcal A = \frac{[\phi(\mathcal X)]^T\phi(\mathcal X)}{\sigma^2} + [\Sigma_{prior}]_{q \times q} \end{cases} P(WData)N(μW,ΣW)μW=σ2A1[ϕ(X)]TYΣW=A1A=σ2[ϕ(X)]Tϕ(X)+[Σprior]q×q
后验概率分布 P ( W ∣ D a t a ) \mathcal P(\mathcal W \mid Data) P(WData)求解结束后,对给定未知样本 x ^ \hat x x^进行预测:
该公式相关参考:高斯分布相关定理
P ( y ^ ∣ x ^ , D a t a ) = ∫ W ∣ D a t a P ( y ^ ∣ W , x ^ ) ⋅ P ( W ∣ D a t a ) d W = N ( [ ϕ ( x ^ ) ] T W , σ 2 ) ⋅ N ( μ W , Σ W ) ∼ N [ [ ϕ ( x ^ ) ] T μ W , [ ϕ ( x ^ ) ] T ⋅ Σ W ⋅ ϕ ( x ^ ) + σ 2 ] = N [ [ ϕ ( x ^ ) ] T ( A − 1 [ ϕ ( X ) ] T Y σ 2 ) , [ ϕ ( x ^ ) ] T A − 1 ⋅ ϕ ( x ^ ) + σ 2 ] \begin{aligned} \mathcal P(\hat y \mid \hat x,Data) & = \int_{\mathcal W \mid Data} \mathcal P(\hat y \mid \mathcal W,\hat x) \cdot \mathcal P(\mathcal W \mid Data) d \mathcal W \\ & = \mathcal N([\phi(\hat x)]^T\mathcal W,\sigma^2) \cdot \mathcal N(\mu_{\mathcal W},\Sigma_{\mathcal W}) \\ & \sim \mathcal N \left[[\phi(\hat x)]^T \mu_{\mathcal W},[\phi(\hat x)]^T \cdot \mathcal \Sigma_{\mathcal W} \cdot \phi(\hat x) + \sigma^2\right] \\ & = \mathcal N \left[[\phi(\hat x)]^T \left(\frac{\mathcal A^{-1} [\phi(\mathcal X)]^T\mathcal Y}{\sigma^2}\right),[\phi(\hat x)]^T \mathcal A^{-1} \cdot \phi(\hat x) + \sigma^2\right] \end{aligned} P(y^x^,Data)=WDataP(y^W,x^)P(WData)dW=N([ϕ(x^)]TW,σ2)N(μW,ΣW)N[[ϕ(x^)]TμW,[ϕ(x^)]TΣWϕ(x^)+σ2]=N[[ϕ(x^)]T(σ2A1[ϕ(X)]TY),[ϕ(x^)]TA1ϕ(x^)+σ2]

从权重空间视角(Weight-Space)到函数空间视角(Function-Space)的过渡

首先,引入非线性转换函数 ϕ ( ⋅ ) \phi(\cdot) ϕ()本身求解是非常复杂的,并且上述公式中的 ϕ ( ⋅ ) \phi(\cdot) ϕ()均以内积的形式出现。因而尝试找到一款函数,使其 直接表示 ϕ ( ⋅ ) \phi(\cdot) ϕ()的内积结果,从而减少大量运算:
K ( x ( i ) , x ( j ) ) = [ ϕ ( x ( i ) ) ] T Σ p r i o r ϕ ( x ( j ) ) \mathcal K(x^{(i)},x^{(j)}) = [\phi(x^{(i)})]^T \Sigma_{prior} \phi(x^{(j)}) K(x(i),x(j))=[ϕ(x(i))]TΣpriorϕ(x(j))
并且 K ( x ( i ) , x ( j ) ) \mathcal K(x^{(i)},x^{(j)}) K(x(i),x(j))核函数(Kernal Function)。从函数空间视角观察,可以将核函数表示为如下形式:

  • 关于 K ( x ( i ) , x ( j ) ) \mathcal K(x^{(i)},x^{(j)}) K(x(i),x(j))是核函数的充分性证明见高斯过程回归——权重空间角度、必要性证明见高斯过程回归——函数空间角度
  • E [ f ( x ( i ) ) ] , E [ f ( x ( j ) ) ] = 0 \mathbb E[f(x^{(i)})],\mathbb E[f(x^{(j)})]= 0 E[f(x(i))],E[f(x(j))]=0是因为 f ( x ( i ) ) = [ x ( i ) ] T W + ϵ f(x^{(i)}) = [x^{(i)}]^T \mathcal W + \epsilon f(x(i))=[x(i)]TW+ϵ,因而 f ( x ( i ) ) ∼ N ( [ x ( i ) ] T W + 0 , σ 2 ) f(x^{(i)}) \sim \mathcal N([x^{(i)}]^T\mathcal W + 0,\sigma^2) f(x(i))N([x(i)]TW+0,σ2)

K ( x ( i ) , x ( j ) ) = [ ϕ ( x ( i ) ) ] T ⋅ E [ W ⋅ W T ] ⋅ ϕ ( x ( j ) ) = E { [ ϕ ( x ( i ) ) ] T W ⋅ [ ϕ ( x ( j ) ) ] T W } = E { [ f ( x ( i ) ) − E [ f ( x ( i ) ) ] ] ⋅ [ f ( x ( j ) ) − E [ f ( x ( j ) ) ] ] } = C o v [ f ( x ( i ) ) , f ( x ( j ) ) ] \begin{aligned} \mathcal K(x^{(i)},x^{(j)}) & = [\phi(x^{(i)})]^T \cdot \mathbb E[\mathcal W \cdot \mathcal W^T] \cdot \phi(x^{(j)}) \\ & = \mathbb E \left\{[\phi(x^{(i)})]^T \mathcal W \cdot [\phi(x^{(j)})]^T \mathcal W\right\} \\ & = \mathbb E \left\{\left[f(x^{(i)}) - \mathbb E[f(x^{(i)})]\right] \cdot \left[f(x^{(j)}) - \mathbb E[f(x^{(j)})]\right]\right\} \\ & = Cov \left[f(x^{(i)}),f(x^{(j)})\right] \end{aligned} K(x(i),x(j))=[ϕ(x(i))]TE[WWT]ϕ(x(j))=E{[ϕ(x(i))]TW[ϕ(x(j))]TW}=E{[f(x(i))E[f(x(i))]][f(x(j))E[f(x(j))]]}=Cov[f(x(i)),f(x(j))]
发现,核函数 K ( x ( i ) , x ( j ) ) \mathcal K(x^{(i)},x^{(j)}) K(x(i),x(j)) f ( x ( i ) ) , f ( x ( j ) ) f(x^{(i)}),f(x^{(j)}) f(x(i)),f(x(j))协方差结果。因此一个想法是:直接将 f ( x ) f(x) f(x)看作随机变量,用 f ( x ) f(x) f(x)来表示后验概率分布和预测分布
f ( x ) f(x) f(x)并不是一个随机变量,而是基于 p p p维实数域的随机变量集合
f ( x ( i ) ) = W T ϕ ( x ( i ) ) = [ ϕ ( x ( i ) ) ] T W x ( i ) ∈ X f(x^{(i)}) = \mathcal W^T \phi(x^{(i)}) = [\phi(x^{(i)})]^T\mathcal W \quad x^{(i)} \in \mathcal X f(x(i))=WTϕ(x(i))=[ϕ(x(i))]TWx(i)X
因而基于 f ( x ) f(x) f(x)预测任务表达式如下:
P ( y ^ ∣ D a t a , x ^ ) = ∫ f ( X ) P ( y ^ ∣ f ( X ) , x ^ ) ⋅ P [ f ( X ) ∣ D a t a ] d f ( X ) \mathcal P(\hat y \mid Data,\hat x) = \int_{f(\mathcal X)} \mathcal P(\hat y \mid f(\mathcal X),\hat x) \cdot \mathcal P[f(\mathcal X) \mid Data] df(\mathcal X) P(y^Data,x^)=f(X)P(y^f(X),x^)P[f(X)Data]df(X)

基于函数空间角度的预测任务求解

随机变量集合 f ( X ) f(\mathcal X) f(X)是一个高斯过程,并且它服从高斯分布
{ f ( X ) } X ∈ R p ∼ N [ μ ( X ) , K ( X , X ) ] \{f(\mathcal X)\}_{\mathcal X \in \mathbb R^p} \sim \mathcal N [\mu(\mathcal X),\mathcal K(\mathcal X,\mathcal X)] {f(X)}XRpN[μ(X),K(X,X)]
其中 m ( X ) m(\mathcal X) m(X)表示均值函数(Mean-Function), K ( X , X ) \mathcal K(\mathcal X,\mathcal X) K(X,X)并非表示某一项,而是整个核矩阵(Kernal Matrix):
K ( X , X ) = [ K ( x ( 1 ) , x ( 1 ) ) , K ( x ( 1 ) , x ( 2 ) ) , ⋯   , K ( x ( 1 ) , x ( N ) ) K ( x ( 2 ) , x ( 1 ) ) , K ( x ( 2 ) , x ( 2 ) ) , ⋯   , K ( x ( 2 ) , x ( N ) ) ⋮ K ( x ( N ) , x ( 1 ) ) , K ( x ( N ) , x ( 2 ) ) , ⋯   , K ( x ( N ) , x ( N ) ) ] N × N \mathcal K(\mathcal X,\mathcal X) = \begin{bmatrix} \mathcal K(x^{(1)},x^{(1)}),\mathcal K(x^{(1)},x^{(2)}),\cdots,\mathcal K(x^{(1)},x^{(N)}) \\ \mathcal K(x^{(2)},x^{(1)}),\mathcal K(x^{(2)},x^{(2)}),\cdots,\mathcal K(x^{(2)},x^{(N)}) \\ \vdots \\ \mathcal K(x^{(N)},x^{(1)}),\mathcal K(x^{(N)},x^{(2)}),\cdots,\mathcal K(x^{(N)},x^{(N)}) \\ \end{bmatrix}_{N \times N} K(X,X)=K(x(1),x(1)),K(x(1),x(2)),,K(x(1),x(N))K(x(2),x(1)),K(x(2),x(2)),,K(x(2),x(N))K(x(N),x(1)),K(x(N),x(2)),,K(x(N),x(N))N×N
因而对应标签向量 Y \mathcal Y Y表示如下:
Y = f ( X ) + ϵ ∼ N [ μ ( X ) , K ( X , X ) + σ 2 I N × N ] \mathcal Y = f(\mathcal X) + \epsilon \sim \mathcal N[\mu(\mathcal X),\mathcal K(\mathcal X,\mathcal X) + \sigma^2 \mathcal I_{N \times N}] Y=f(X)+ϵN[μ(X),K(X,X)+σ2IN×N]

此时,已知一个新样本集合 X ∗ = ( x ∗ ( 1 ) , x ∗ ( 2 ) , ⋯   , x ∗ ( M ) ) M × p T \mathcal X_* = (x_*^{(1)},x_*^{(2)},\cdots,x_*^{(\mathcal M)})_{\mathcal M \times p}^T X=(x(1),x(2),,x(M))M×pT,那么预测标签 Y ∗ = f ( X ∗ ) + ϵ \mathcal Y_* = f(\mathcal X_*) + \epsilon Y=f(X)+ϵ。首先,针对标签集合 Y \mathcal Y Y无高斯噪声结果 f ( X ∗ ) f(\mathcal X_*) f(X)联合概率分布 P [ f ( X ∗ ) , Y ∣ X , X ∗ ] \mathcal P \left[f(\mathcal X_*),\mathcal Y \mid \mathcal X ,\mathcal X_*\right] P[f(X),YX,X]表示如下:
[ Y f ( X ∗ ) ] ( N + M ) × 1 ∼ N { [ μ ( X ) μ ( X ∗ ) ] , [ K ( X , X ) + σ 2 I N × N , K ( X , X ∗ ) N × M K ( X ∗ , X ) M × N K ( X ∗ , X ∗ ) M × M ] ( N + M ) × ( N + M ) } \begin{bmatrix} \mathcal Y \\ \quad \\ f(\mathcal X_*) \end{bmatrix}_{(N+\mathcal M) \times 1} \sim \mathcal N \left\{\begin{bmatrix} \mu(\mathcal X) \\ \quad \\ \mu(\mathcal X_*) \end{bmatrix},\begin{bmatrix}\mathcal K(\mathcal X,\mathcal X) + \sigma^2 \mathcal I_{N \times N},\mathcal K(\mathcal X,\mathcal X_*)_{N \times \mathcal M} \\ \quad \\ \mathcal K(\mathcal X_*,\mathcal X)_{\mathcal M \times N} \quad\quad \mathcal K(\mathcal X_*,\mathcal X_*)_{\mathcal M \times \mathcal M}\end{bmatrix}_{(N+\mathcal M) \times (N+\mathcal M)}\right\} Yf(X)(N+M)×1Nμ(X)μ(X),K(X,X)+σ2IN×N,K(X,X)N×MK(X,X)M×NK(X,X)M×M(N+M)×(N+M)

此时就变成了已知联合概率分布,求解条件概率分布 P [ f ( X ∗ ) ∣ D a t a , X ∗ ] = P [ f ( X ∗ ) ∣ Y , X , X ∗ ] \mathcal P \left[f(\mathcal X_*) \mid Data,\mathcal X_*\right] = \mathcal P\left[ f(\mathcal X_*) \mid \mathcal Y,\mathcal X,\mathcal X_*\right] P[f(X)Data,X]=P[f(X)Y,X,X]的形式。
这里用到了基于高斯分布的推断任务——已知联合概率分布求解条件概率分布的相关内容,这里就不推导了。

假设条件概率分布高斯分布形式为: P [ f ( X ∗ ) ∣ Y , X , X ∗ ] ∼ N ( μ ∗ , Σ ∗ ) \mathcal P\left[ f(\mathcal X_*) \mid \mathcal Y,\mathcal X,\mathcal X_*\right]\mathcal \sim N(\mu^*,\Sigma^*) P[f(X)Y,X,X]N(μ,Σ),那么 μ ∗ , Σ ∗ \mu^*,\Sigma^* μ,Σ分别表示如下:
{ μ ∗ = K ( X ∗ , X ) ⋅ [ K ( X , X ) + σ 2 I ] − 1 [ Y − μ ( X ) ] + μ ( X ∗ ) Σ ∗ = K ( X ∗ , X ∗ ) − K ( X ∗ , X ) [ K ( X , X ) + σ 2 I ] − 1 K ( X , X ∗ ) \begin{cases} \mu^* = \mathcal K(\mathcal X_*,\mathcal X) \cdot [\mathcal K(\mathcal X,\mathcal X) + \sigma^2 \mathcal I]^{-1}[\mathcal Y - \mu(\mathcal X)] + \mu(\mathcal X_*) \\ \Sigma^* = \mathcal K(\mathcal X_*,\mathcal X_*) - \mathcal K(\mathcal X_*,\mathcal X)[\mathcal K(\mathcal X,\mathcal X) + \sigma^2 \mathcal I]^{-1} \mathcal K(\mathcal X,\mathcal X_*) \end{cases} {μ=K(X,X)[K(X,X)+σ2I]1[Yμ(X)]+μ(X)Σ=K(X,X)K(X,X)[K(X,X)+σ2I]1K(X,X)

此时 f ( X ∗ ) f(\mathcal X_*) f(X)条件/后验概率分布已经求解,但此时是无高斯噪声状态,需要将高斯噪声加回去。那么关于 Y ∗ \mathcal Y_* Y后验 P ( Y ∗ ∣ D a t a , X ∗ ) \mathcal P(\mathcal Y_* \mid Data,\mathcal X_*) P(YData,X)可表示为:
Y ∗ = f ( X ∗ ) + ϵ P ( Y ∗ ∣ D a t a , X ∗ ) ∼ N ( μ Y ∗ , Σ Y ∗ ) { μ Y ∗ = μ ∗ + 0 = μ ∗ Σ Y ∗ = Σ ∗ + σ 2 I M × M \begin{aligned} \mathcal Y_* & = f(\mathcal X_*) + \epsilon \\ \mathcal P(\mathcal Y_* \mid Data ,\mathcal X_*) & \sim \mathcal N(\mu_{\mathcal Y}^*,\Sigma_{\mathcal Y}^*) \begin{cases} \mu_{\mathcal Y}^* = \mu^* + 0 = \mu^* \\ \Sigma_{\mathcal Y}^* = \Sigma^* + \sigma^2 \mathcal I_{\mathcal M \times \mathcal M} \end{cases} \end{aligned} YP(YData,X)=f(X)+ϵN(μY,ΣY){μY=μ+0=μΣY=Σ+σ2IM×M

实际上,使用函数空间角度求解预测任务相比于权重空间角度求解要简单一些。由于将随机变量集合设定为高斯过程,自然不会受到 ϕ ( ⋅ ) \phi(\cdot) ϕ()的影响;并且它不需要求解模型参数 W \mathcal W W的后验概率,只需要通过推断预测任务进行处理即可。

至此,高斯过程部分介绍结束,在后续会将高斯过程贝叶斯线性回归相关符号进行检查和修正。

相关参考:
机器学习-高斯过程回归-函数空间角度(Function-Space)