arxiv.org/pdf/1512.03385.pdf
5 Users
0 Comments
94 Highlights
4 Notes
Tags
Top Highlights
When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher train- ing error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.
The formulation of F(x) + x can be realized by feedfor- ward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2).
In this paper, we address the degradation problem by introducing a deep residual learning framework. In- stead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these lay- ers fit a residual mapping.
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously.
We explicitly reformulate the layers as learn- ing residual functions with reference to the layer inputs, in- stead of learning unreferenced functions
we let the stacked nonlinear layers fit another mapping of F(x) := H(x) − x
Identity short- cut connections add neither extra parameter nor computa- tional complexity. The entire network can still be trained end-to-end by SGD with backpropagation,
e initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256.
Deep Residual Learning for Image Recognition
Is learning better networks as easy as stacking more layers?
In- stead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these lay- ers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x) − x. The orig- inal mapping is recast into F(x)+x
ur formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with addi- tional residual functions to be learned.
ather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x.
The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning re- formulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear lay- ers toward zero to approach identity mappings.
Formally, in this paper we consider a building block defined as: y = F(x, {Wi}) + x
The function F(x, {Wi}) represents the residual mapping to be learned.
The operation F + x is performed by a shortcut connection and element-wise addition
We adopt the sec- ond nonlinearity after the addition (i.e., σ(y), see Fig. 2).
We can fairly com- pare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computa- tional cost (except for the negligible element-wise addition)
The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.