Sunday 30 December 2018

[Research] Gradient Descent, Random Gradient Descent and Mini-batch (Batch) Gradient Descent

1. Introduction

From this post, Error BackPropagation, I introduced the inference details about this algorithm. In that post, it showed a standard backpropagation algorithm. Its corresponding optimization method is called gradient descent (GD). The feature of standard backpropagation is that all examples are the one-time input. When optimizing, all examples are used to update the parameters (weights and bias). This method has merit and demerit.   
Here is another type of backpropagation whose feature is to input one example at a time. the corresponding optimization is called random gradient descent (RGD). Similarly, This method also has merit and demerit. 
What are the merits and demerits of these two methods and what is the difference between them? Let's uncover it.       

2. Background 

2.1 Input and Output

Before we use our models to conduct tasks, we need to complete the training phase. The essence of this phase is to compute the value of the parameters of our models. Under supervised learning, given an input, we know its ground truth which is the output. Generally, the form of the input is a multi-dimension vector, e.g. X. And the output is also a multi-dimension vector, e.g. Y. Each (X, Y) composes the examples of the training phase. 

2.2 Gradient Descent and Random Gradient Descent

So an intuitive difference between these two forms of BP is that (as shown below):
Figure 2.1 an example as the input
Figure 2.2 all examples as the input.
For the all-examples one, all examples share the parameters of the model. And as the loss function is the sum of all-example loss functions. All examples are used to compute the gradient. When computing the gradient of parameters, for all examples, they have a unified gradient. We just call it gradient descent (GD).

However, for one-example one, all examples have different status of the parameters as the parameters update based on the former update. The gradient will be computed using the current single example. Since each example has different input and output, the gradient of each example is different. Thus, the successive inputs have successive gradients that are different from each other. We call this form of gradient stochastic gradient descent (RGD).

2.3 Comparison

The features of the gradient descent are:
  • All examples have a unified gradient. The optimization speed would be fast.
  • From the aspect of features, all examples can be viewed as a all features included set. Then the model can learn better (reach the global extreme)
  • However, all example input at a time, the requirement of memory and computation power is high.
The features of the stochastic gradient descent are:
  • Obviously, inputting one example at a time has a very light computational load.
  • Not every example is needed to compute the parameter as the parameters update successively.
  • stochastic gradient descent doesn't have a unified gradient, so the stochastic gradient makes a low-speed optimization and less likely reaches the global extreme as show below.  
Figure 2.3

3. Batch gradient descent

Now that the standard gradient descent and the stochastic gradient descent both hardly meet the requirement, a batch gradient descent (Mini-batch) is proposed.
Figure 3.1 Batch gradient descent
From the figure above, we can see the batch gradient descent is a combination of standard gradient descent and stochastic gradient descent. When the size of a batch is 1, batch gradient descent is stochastic gradient descent. When the size of a batch equals the size of all examples, batch gradient descent is the standard gradient descent. 

3.1 Iteration、Epoch、Batchsize

We know each batch is viewed as an input to train the model when using batch gradient descent. Here are three variables we need to understand.
  • Batchsize: is the number of examples in each batch. So the batchnumber equals the total number of examples divided by batchsize.
  • Epoch: one epoch means a process that all batches are used. Once batchnumber batches are used to train the parameters, we count 1 for the epoch.
  • Iteration: as the parameters update successively, so the iteration means the number of update of the parameters.
Take an example:
There are 49000 examples, we want to train all examples 10 times, therefore, epoch=10. We set 100 as the size of batch, therefore, batchsize=100, batchnumber=49000/100=490. To finish 10-time training of all examples, the total number of iteration is iteration=batchnumber*epoch=4900. Obviously, we only set the epoch and batchsize, the number of iteration is determined.
As these variables are set ahead of training, they are not trainable. 

3.2 Learning Rate

Epoch determines how many times all examples will be used for training. The batch gradient descent method updates successively. So the more training times of all examples, the less the model can learn. The learning rate is attenuated. Thus, we need to set a learning rate for each epoch training.   
Generally, we use the exponential decay function to set the learning rate.
Figure 3.2 decay function
From this function, we need to initiate the learning rate, the decay coefficient, the number of batches. Iteration means current times of update of parameters. If  we make iteration/batchnumber an integrate, the value of iteration/batchnumber will change in every batchnumber times. We give an approximate graph about the decay function is:
Figure 3.3 decay function graph
Here, global_step means current iteration times and decay_steps means the batchnumber.
The staircase learning rate means the process of global_step/decay_steps. The continuous learning rate means directly using the value of global_step/decay_steps.  

Generally, the learning rate is very small, like 0.01

4. Conclusion

 Having a right understanding of gradient descent is very important especially when you implement a neural network. Three different types of gradient descent show the different ways of chosing examples. Though the size of a batch has a big effect on the performance of our model, the value of it is empirical. We need to test for each specific dataset.  

Reference


No comments:

Post a Comment

[Research] Recurrent Neural Network (RNN)

1. Introduction As we all know, forward neural networks (FNN) have no connection between neurons in the same layer or in cross layers. Th...