Part3 Optimization

Optimization is the process of finding $W$ that minimize the loss function.

Gradient Descent

Imagine we are on a hill. We want to find our way to the bottom. The gradient is the direction of steepest ascent (derivative), so we need to go in the opposite direction to reach the bottom.

graph TD;
    A[Start] --> F[Compute loss L];
    F --> B{Is L minimized?};
    B -- Yes --> C[Stop];
    B -- No --> D[Compute gradient dW];
    D --> E[Update weights W'=W-dW*learning_rate];
    E --> F;

Definition

Learning Rate:

A hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated.

If learning rate is too small, training will be slow. If learning rate is too large, training will "overstep".

SGD

If our training data is too large, it will take too much time to compute the loss $L$. Therefore, we only need to compute the gradient over batches of the training data. This is called mini-batch gradient descent.

SGD (stochastic gradient descent) is the extreme case when the batch size is $1$.

Part3 Optimization

Gradient Descent

SGD

评论