论文标题
神经网络有效地学习使用SGD的低维表示
Neural Networks Efficiently Learn Low-Dimensional Representations with SGD
论文作者
论文摘要
我们研究了使用随机梯度下降(SGD)训练两层神经网络(nn)的问题,其中输入$ \ boldsymbol {x} \ in \ in \ mathbb {r}^d $是高斯,而目标$ y \ in \ mathbb {r} $ in \ mathbb {r} $ contry protery-y protery-y proult protery-insex模型。 $ y = g(\ langle \ boldsymbol {u_1},\ boldsymbol {x} \ rangle,...,\ langle \ langle \ boldsymbol {u_k},\ boldsymbol {x} \ rangle)$带有噪声链接函数$ g $。我们证明,NN的第一层权重汇合到vectors $ \ boldsymbol {u_1},...,\ boldsymbol {u_k} $的$ k $二维主要子空间,当在线使用重量衰减时,\ boldsymbol {u_k} $。当$ k \ ll d $ $时,这种现象会产生一些重要的后果。首先,通过在这个较小的子空间上采用统一的收敛性,我们建立了$ O(\ sqrt {{kd}/{t}}}})$的概括误差,在$ t $ iterations $ t $ iterations sgd之后,该错误与NN的宽度无关。 We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + ε$ by recovering the principal direction, with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $ε$是噪音。这与已知的$ d^{ω(p)} $样本要求在内核方面学习任何程度$ p $多项式的要求,它表明,经过SGD训练的NNS可以在初始化时胜过神经切线内核。最后,我们还使用SGD产生的近似低级结构为NN提供了可压缩性保证。
We study the problem of training a two-layer neural network (NN) of arbitrary width using stochastic gradient descent (SGD) where the input $\boldsymbol{x}\in \mathbb{R}^d$ is Gaussian and the target $y \in \mathbb{R}$ follows a multiple-index model, i.e., $y=g(\langle\boldsymbol{u_1},\boldsymbol{x}\rangle,...,\langle\boldsymbol{u_k},\boldsymbol{x}\rangle)$ with a noisy link function $g$. We prove that the first-layer weights of the NN converge to the $k$-dimensional principal subspace spanned by the vectors $\boldsymbol{u_1},...,\boldsymbol{u_k}$ of the true model, when online SGD with weight decay is used for training. This phenomenon has several important consequences when $k \ll d$. First, by employing uniform convergence on this smaller subspace, we establish a generalization error bound of $O(\sqrt{{kd}/{T}})$ after $T$ iterations of SGD, which is independent of the width of the NN. We further demonstrate that, SGD-trained ReLU NNs can learn a single-index target of the form $y=f(\langle\boldsymbol{u},\boldsymbol{x}\rangle) + ε$ by recovering the principal direction, with a sample complexity linear in $d$ (up to log factors), where $f$ is a monotonic function with at most polynomial growth, and $ε$ is the noise. This is in contrast to the known $d^{Ω(p)}$ sample requirement to learn any degree $p$ polynomial in the kernel regime, and it shows that NNs trained with SGD can outperform the neural tangent kernel at initialization. Finally, we also provide compressibility guarantees for NNs using the approximate low-rank structure produced by SGD.
