Dropout

노트

Moreover, the convergence properties of dropout can be understood in terms of stochastic gradient descent.^[1]
The weights of the network will be larger than normal because of dropout.^[2]
Therefore, before finalizing the network, the weights are first scaled by the chosen dropout rate.^[2]
This is sometimes called “inverse dropout” and does not require any modification of weights during training.^[2]
At test time, we scale down the output by the dropout rate.^[2]
Now that we know a little bit about dropout and the motivation, let’s go into some detail.^[3]
If you just wanted an overview of dropout in neural networks, the above two sections would be sufficient.^[3]
In dropout, we randomly shut down some fraction of a layer’s neurons at each training step by zeroing out the neuron values.^[4]
The fraction of neurons to be zeroed out is known as the dropout rate, .^[4]
The two images represent dropout applied to a layer of 6 units, shown at multiple training steps.^[4]
The dropout rate is 1/3, and the remaining 4 neurons at each training step have their value scaled by x1.5.^[4]
In this paper we conduct an empirical study to investigate the effect of dropout and batch normalization on training deep learning models.^[5]
Section 3 systematically describes the depth calculation model based on adaptive dropout proposed in this paper.^[6]
Finally, the value of the dropout rate for each layer needs to be in the interval (0, 1).^[6]