# Paper in Machine Learning

## Large-Scale Machine Learning with Stochastic Gradient Descent [Leon,2010]

### Learning with gradient descent

The training set performance is given the empirical risk $E_n(f) = \frac1n \sum_{i=1}^n l(f(x_i),y_i)$ while the expected risk measures the generalization performance.

#### Gradient descent

• GD achieves linear convergence under sufficient regularity assumptions and when the initialization is close to optimal.
• 2GD achieves quadratic convergence using 2nd ordre gradient (Hessian)
• SGD: instead of computing the exact gradient, each iteration we use one single randomly picked example to estimate the gradient: the related convergence is considered to be $t^{-1}$.

### Learning with large training sets

#### The tradeoffs of large scale learning

• the sum of approximation error + estimation error + optimization error
• in large-scale setting, the first constrain is the max computing time: optimization and estimation can achieve better expected risk because we have more data!

#### Asymptotic analysis

• to reach a predefined expected risk: may be not efficient for training data but will be efficient for generalization in large-scale settings!

### Efficient learning

• 2nd order SGD is vary powerful, but as well computationally costly
• solution:
1. approximation of the inverse of Hessian using a computationally efficient way, e.g., SGDQN
2. ASDG: compute the average of classical SGD: when learning rate decrease slower than $t^{-1}$, it achieves optimal convergence rate, but this asymptotic can take long time, in practice people may use $\alpha_t = \alpha_0 (1 + \lambda \alpha_0 t)^{-1}$ for SGD and $\alpha_t = \alpha_0 (1 + \lambda \alpha_0 t)^{-0.75}$ for ASGD [Xu, 2010]

## Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift [2015]

### Abstract

• the distribution of each layer's input change during learning -> lower learning rate and sensitive to initialization + problem with saturating nonlinearities
• because of internat covariance shift and can be handled with batch normalization
• result in the same accuracy with less training steps

### Introduction

• SGD: training in steps. at each step we train with a mini-batch
• advantages:
1. the gradients calculated can be seen as an estimate of the real gradient over the whole training set
2. this computation can be paralysed using for example GPU
• however: SGD is sensible to learning rate and initialization : a small change to the network parameters amplify in deep nets
• this problem is seen as covariate shift, typically handleable via domain adaptation
• the idea is that, the distribution of the input of each layer should be fixed over time
• because it is connected to the saturation of nonlinear activation functions: often handled with
1. RELU
2. careful init
3. small learning rate
• in this paper: batch normalization to fix the mean and variance of the input of each layer, because:
1. help gradient converge
2. allow for higher learning rate
3. no need for Dropout
4. help use saturating nonlinearity

### Towards Reducing Internal Covariate Shift

• nets converge faster if its inputs are whitened
• but problem is we do it directly: reduce the effect of GD because we did not consider the normalization in GD
• and it is too computational expensive: we need to find a new ways out

### Normalization via Mini-Batch Statistics

• in this paper: use two simplifications
1. normalize each scale feature independently (expectation taking over the whole training set): but may change the data presentation in each layer, thus
2. introduce a pair of parameters $\gamma,\beta$ to scale and shift the normalized value: these parameters are learned through BP
• adapte to mini-batch setting: regularization may be necessary in case of singular matrices

## Dropout: A Simple Way to Prevent Neural Nets from Overfitting [2014]

• objective is to avoid overfitting
• to randomly drop units (alongs with their connections) during training
• is in fact also a model combination of exponentially many dif nets
• with a fixed probability $p$: should be close to $1$ for input units and to $0.5$ for others (hidden and output)

### More ideas

• DropConnect [2013] is said to behave better: link
Advertisements