1. Introduction
From this post, Error BackPropagation, I introduced the inference details about this algorithm. In that post, it showed a standard backpropagation algorithm. Its corresponding optimization method is called gradient descent (GD). The feature of standard backpropagation is that all examples are the one-time input. When optimizing, all examples are used to update the parameters (weights and bias). This method has merit and demerit.Here is another type of backpropagation whose feature is to input one example at a time. the corresponding optimization is called random gradient descent (RGD). Similarly, This method also has merit and demerit.
What are the merits and demerits of these two methods and what is the difference between them? Let's uncover it.
2. Background
2.1 Input and Output
Before we use our models to conduct tasks, we need to complete the training phase. The essence of this phase is to compute the value of the parameters of our models. Under supervised learning, given an input, we know its ground truth which is the output. Generally, the form of the input is a multi-dimension vector, e.g. X. And the output is also a multi-dimension vector, e.g. Y. Each (X, Y) composes the examples of the training phase.2.2 Gradient Descent and Random Gradient Descent
So an intuitive difference between these two forms of BP is that (as shown below):Figure 2.1 an example as the input |
Figure 2.2 all examples as the input. |
However, for one-example one, all examples have different status of the parameters as the parameters update based on the former update. The gradient will be computed using the current single example. Since each example has different input and output, the gradient of each example is different. Thus, the successive inputs have successive gradients that are different from each other. We call this form of gradient stochastic gradient descent (RGD).
2.3 Comparison
The features of the gradient descent are:
- All examples have a unified gradient. The optimization speed would be fast.
- From the aspect of features, all examples can be viewed as a all features included set. Then the model can learn better (reach the global extreme)
- However, all example input at a time, the requirement of memory and computation power is high.
The features of the stochastic gradient descent are:
- Obviously, inputting one example at a time has a very light computational load.
- Not every example is needed to compute the parameter as the parameters update successively.
- stochastic gradient descent doesn't have a unified gradient, so the stochastic gradient makes a low-speed optimization and less likely reaches the global extreme as show below.
Figure 2.3 |
3. Batch gradient descent
Now that the standard gradient descent and the stochastic gradient descent both hardly meet the requirement, a batch gradient descent (Mini-batch) is proposed.Figure 3.1 Batch gradient descent |
3.1 Iteration、Epoch、Batchsize
We know each batch is viewed as an input to train the model when using batch gradient descent. Here are three variables we need to understand.
- Batchsize: is the number of examples in each batch. So the batchnumber equals the total number of examples divided by batchsize.
- Epoch: one epoch means a process that all batches are used. Once batchnumber batches are used to train the parameters, we count 1 for the epoch.
- Iteration: as the parameters update successively, so the iteration means the number of update of the parameters.
Take an example:
There are 49000 examples, we want to train all examples 10 times, therefore, epoch=10. We set 100 as the size of batch, therefore, batchsize=100, batchnumber=49000/100=490. To finish 10-time training of all examples, the total number of iteration is iteration=batchnumber*epoch=4900. Obviously, we only set the epoch and batchsize, the number of iteration is determined.
As these variables are set ahead of training, they are not trainable.
3.2 Learning Rate
Epoch determines how many times all examples will be used for training. The batch gradient descent method updates successively. So the more training times of all examples, the less the model can learn. The learning rate is attenuated. Thus, we need to set a learning rate for each epoch training.Generally, we use the exponential decay function to set the learning rate.
Figure 3.2 decay function |
Figure 3.3 decay function graph |
The staircase learning rate means the process of ⌊global_step/decay_steps⌋. The continuous learning rate means directly using the value of global_step/decay_steps.
Generally, the learning rate is very small, like 0.01
No comments:
Post a Comment