Paper in Machine Learning

Large-Scale Machine Learning with Stochastic Gradient Descent [Leon,2010]


Learning with gradient descent

The training set performance is given the empirical risk E_n(f) = \frac1n \sum_{i=1}^n  l(f(x_i),y_i) while the expected risk measures the generalization performance.

Gradient descent

  • GD achieves linear convergence under sufficient regularity assumptions and when the initialization is close to optimal.
  • 2GD achieves quadratic convergence using 2nd ordre gradient (Hessian)
  • SGD: instead of computing the exact gradient, each iteration we use one single randomly picked example to estimate the gradient: the related convergence is considered to be t^{-1} .

Learning with large training sets

The tradeoffs of large scale learning

  • the sum of approximation error + estimation error + optimization error
  • in large-scale setting, the first constrain is the max computing time: optimization and estimation can achieve better expected risk because we have more data!

Asymptotic analysis

  • to reach a predefined expected risk: may be not efficient for training data but will be efficient for generalization in large-scale settings!

Efficient learning

  • 2nd order SGD is vary powerful, but as well computationally costly
  • solution:
    1. approximation of the inverse of Hessian using a computationally efficient way, e.g., SGDQN
    2. ASDG: compute the average of classical SGD: when learning rate decrease slower than t^{-1} , it achieves optimal convergence rate, but this asymptotic can take long time, in practice people may use \alpha_t = \alpha_0 (1 + \lambda \alpha_0 t)^{-1} for SGD and \alpha_t = \alpha_0 (1 + \lambda \alpha_0 t)^{-0.75} for ASGD [Xu, 2010]

Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift [2015]


  • the distribution of each layer's input change during learning -> lower learning rate and sensitive to initialization + problem with saturating nonlinearities
  • because of internat covariance shift and can be handled with batch normalization
  • result in the same accuracy with less training steps


  • SGD: training in steps. at each step we train with a mini-batch
  • advantages:
    1. the gradients calculated can be seen as an estimate of the real gradient over the whole training set
    2. this computation can be paralysed using for example GPU
  • however: SGD is sensible to learning rate and initialization : a small change to the network parameters amplify in deep nets
  • this problem is seen as covariate shift, typically handleable via domain adaptation
  • the idea is that, the distribution of the input of each layer should be fixed over time
  • because it is connected to the saturation of nonlinear activation functions: often handled with
    1. RELU
    2. careful init
    3. small learning rate
  • in this paper: batch normalization to fix the mean and variance of the input of each layer, because:
    1. help gradient converge
    2. allow for higher learning rate
    3. no need for Dropout
    4. help use saturating nonlinearity

Towards Reducing Internal Covariate Shift

  • nets converge faster if its inputs are whitened
  • but problem is we do it directly: reduce the effect of GD because we did not consider the normalization in GD
  • and it is too computational expensive: we need to find a new ways out

Normalization via Mini-Batch Statistics

  • in this paper: use two simplifications
    1. normalize each scale feature independently (expectation taking over the whole training set): but may change the data presentation in each layer, thus
    2. introduce a pair of parameters \gamma,\beta to scale and shift the normalized value: these parameters are learned through BP
  • adapte to mini-batch setting: regularization may be necessary in case of singular matrices

Dropout: A Simple Way to Prevent Neural Nets from Overfitting [2014]

  • objective is to avoid overfitting
  • to randomly drop units (alongs with their connections) during training
  • is in fact also a model combination of exponentially many dif nets
  • with a fixed probability p : should be close to 1 for input units and to 0.5 for others (hidden and output)

More ideas

  • DropConnect [2013] is said to behave better: link

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s