- Large-Scale Machine Learning with Stochastic Gradient Descent [Leon,2010]
- Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift [2015]
- Dropout: A Simple Way to Prevent Neural Nets from Overfitting [2014]

## Large-Scale Machine Learning with Stochastic Gradient Descent [Leon,2010]

### Introduction

### Learning with gradient descent

The training set performance is given the *empirical risk* while the *expected risk* measures the generalization performance.

#### Gradient descent

- GD achieves
*linear convergence*under sufficient regularity assumptions and when the initialization is close to optimal. - 2GD achieves
*quadratic convergence*using 2nd ordre gradient (Hessian) - SGD: instead of computing the exact gradient, each iteration we use
**one**single randomly picked example to estimate the gradient: the related convergence is considered to be .

### Learning with large training sets

#### The tradeoffs of large scale learning

- the sum of
*approximation error*+*estimation error*+*optimization error* - in large-scale setting, the first constrain is the
**max computing time**: optimization and estimation can achieve better expected risk because we have**more data**!

#### Asymptotic analysis

- to reach a predefined expected risk: may be not efficient for
*training data*but will be efficient for*generalization*in large-scale settings!

### Efficient learning

- 2nd order SGD is vary powerful, but as well computationally costly
- solution:
- approximation of the inverse of Hessian using a computationally efficient way, e.g., SGDQN
- ASDG: compute the average of classical SGD: when learning rate decrease slower than , it achieves optimal convergence rate, but this asymptotic can take long time, in practice people may use for SGD and for ASGD [Xu, 2010]

## Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift [2015]

### Abstract

- the distribution of each layer's input change during learning -> lower learning rate and sensitive to initialization + problem with saturating nonlinearities
- because of
*internat covariance shift*and can be handled with*batch normalization* - result in the same accuracy with less training steps

### Introduction

- SGD: training in steps. at each step we train with a
*mini-batch* - advantages:
- the gradients calculated can be seen as an estimate of the real gradient over the whole training set
- this computation can be paralysed using for example GPU

- however: SGD is sensible to
**learning rate**and**initialization**: a small change to the network parameters amplify in deep nets - this problem is seen as
*covariate shift*, typically handleable via*domain adaptation* - the idea is that, the distribution of the input of each layer should be
*fixed*over time - because it is connected to the saturation of nonlinear activation functions: often handled with
- RELU
- careful init
- small learning rate

- in this paper:
**batch normalization**to fix the mean and variance of the input of each layer, because:- help gradient converge
- allow for higher learning rate
- no need for
*Dropout* - help use saturating nonlinearity

### Towards Reducing Internal Covariate Shift

- nets converge faster if its inputs are
**whitened** - but problem is we do it directly: reduce the effect of GD because we did not consider the normalization in GD
- and it is too computational expensive: we need to find a new ways out

### Normalization via Mini-Batch Statistics

- in this paper: use two simplifications
- normalize each scale feature independently (expectation taking over the whole training set): but may change the data presentation in each layer, thus
- introduce a pair of parameters to scale and shift the normalized value: these parameters are learned through BP

- adapte to mini-batch setting: regularization may be necessary in case of singular matrices

## Dropout: A Simple Way to Prevent Neural Nets from Overfitting [2014]

- objective is to avoid
**overfitting** - to randomly drop units (alongs with their connections) during training
- is in fact also a
**model combination**of exponentially many dif nets - with a
**fixed**probability : should be close to for input units and to for others (hidden and output)

### More ideas

- DropConnect [2013] is said to behave better: link

Advertisements