zl程序教程

您现在的位置是:首页 >  其他

当前栏目

通过部分Jacobian对广义和深层神经网络进行关键初始化:一般理论和对LayerNorm的应用

2023-03-20 14:50:40 时间

深度神经网络因藐视理论处理而臭名昭著。然而,当每层的参数数趋于无穷大时,网络函数是一个高斯过程(GP),定量的预测性描述是可能的。高斯近似允许制定选择超参数的标准,如权重和偏差的方差,以及学习率。这些标准依赖于为深度神经网络定义的临界性概念。在这项工作中,我们描述了一种新的方法来诊断(在理论上和经验上)这种临界性。为此,我们引入了网络的部分雅各布,定义为第l层的预激活相对于第l0<l层预激活的导数。当网络结构涉及许多不同的层时,这些量特别有用。我们讨论了部分Jacobian的各种属性,如它们与深度的比例关系以及与神经切线核(NTK)的关系。我们推导出部分雅各布的递归关系,并利用它们来分析有(或无)LayerNorm的深度MLP网络的临界性。我们发现,规范化层改变了最佳值。

原文题目:Critical initialization of wide and deep neural networks through partial Jacobians: general theory and applications to LayerNorm

原文:Deep neural networks are notorious for defying theoretical treatment. However, when the number of parameters in each layer tends to infinity the network function is a Gaussian process (GP) and quantitatively predictive description is possible. Gaussian approximation allows to formulate criteria for selecting hyperparameters, such as variances of weights and biases, as well as the learning rate. These criteria rely on the notion of criticality defined for deep neural networks. In this work we describe a new way to diagnose (both theoretically and empirically) this criticality. To that end, we introduce partial Jacobians of a network, defined as derivatives of preactivations in layer l with respect to preactivations in layer l0<l. These quantities are particularly useful when the network architecture involves many different layers. We discuss various properties of the partial Jacobians such as their scaling with depth and relation to the neural tangent kernel (NTK). We derive the recurrence relations for the partial Jacobians and utilize them to analyze criticality of deep MLP networks with (and without) LayerNorm. We find that the normalization layer changes the optimal value.