Paper in Machine Learning

Large-Scale Machine Learning with Stochastic Gradient Descent [Leon,2010]

Introduction

The training set performance is given the empirical risk $E_n(f) = \frac1n \sum_{i=1}^n l(f(x_i),y_i)$ while the expected risk measures the generalization performance.

• GD achieves linear convergence under sufficient regularity assumptions and when the initialization is close to optimal.
• SGD: instead of computing the exact gradient, each iteration we use one single randomly picked example to estimate the gradient: the related convergence is considered to be $t^{-1}$.

Learning with large training sets

The tradeoffs of large scale learning

• the sum of approximation error + estimation error + optimization error
• in large-scale setting, the first constrain is the max computing time: optimization and estimation can achieve better expected risk because we have more data!

Asymptotic analysis

• to reach a predefined expected risk: may be not efficient for training data but will be efficient for generalization in large-scale settings!

Efficient learning

• 2nd order SGD is vary powerful, but as well computationally costly
• solution:
1. approximation of the inverse of Hessian using a computationally efficient way, e.g., SGDQN
2. ASDG: compute the average of classical SGD: when learning rate decrease slower than $t^{-1}$, it achieves optimal convergence rate, but this asymptotic can take long time, in practice people may use $\alpha_t = \alpha_0 (1 + \lambda \alpha_0 t)^{-1}$ for SGD and $\alpha_t = \alpha_0 (1 + \lambda \alpha_0 t)^{-0.75}$ for ASGD [Xu, 2010]

Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift 

Abstract

• the distribution of each layer's input change during learning -> lower learning rate and sensitive to initialization + problem with saturating nonlinearities
• because of internat covariance shift and can be handled with batch normalization
• result in the same accuracy with less training steps

Introduction

• SGD: training in steps. at each step we train with a mini-batch
1. the gradients calculated can be seen as an estimate of the real gradient over the whole training set
2. this computation can be paralysed using for example GPU
• however: SGD is sensible to learning rate and initialization : a small change to the network parameters amplify in deep nets
• this problem is seen as covariate shift, typically handleable via domain adaptation
• the idea is that, the distribution of the input of each layer should be fixed over time
• because it is connected to the saturation of nonlinear activation functions: often handled with
1. RELU
2. careful init
3. small learning rate
• in this paper: batch normalization to fix the mean and variance of the input of each layer, because:
2. allow for higher learning rate
3. no need for Dropout
4. help use saturating nonlinearity

Towards Reducing Internal Covariate Shift

• nets converge faster if its inputs are whitened
• but problem is we do it directly: reduce the effect of GD because we did not consider the normalization in GD
• and it is too computational expensive: we need to find a new ways out

Normalization via Mini-Batch Statistics

• in this paper: use two simplifications
1. normalize each scale feature independently (expectation taking over the whole training set): but may change the data presentation in each layer, thus
2. introduce a pair of parameters $\gamma,\beta$ to scale and shift the normalized value: these parameters are learned through BP
• adapte to mini-batch setting: regularization may be necessary in case of singular matrices

Dropout: A Simple Way to Prevent Neural Nets from Overfitting 

• objective is to avoid overfitting
• to randomly drop units (alongs with their connections) during training
• is in fact also a model combination of exponentially many dif nets
• with a fixed probability $p$: should be close to $1$ for input units and to $0.5$ for others (hidden and output)

More ideas

• DropConnect  is said to behave better: link