Introduction to Gradient Decent in Neural Networks – Learn Machine Learning

In this lesson, we would examine the learning process in Neural Networks. Remember that a neural network is a classifier that could learn from a set of training data set.
The learning method or algorithm in Neural Networks is known as gradient-decent and that is what we would consider in this lesson.

Remember that the output of a neural network depends on the weights and biases of the input neurons.
So to begin, this weights and biases could be initialized to random values at the start of the network. If we do this and then give and input to the network, say a handwritten digit (in a grid of 28 by 28 pixels), the output would likely be wrong because the weights and biases are not tuned.
So Gradient Decent is a way of tuning this variables to enable the neural network make better decisions.

How Does it Work
We need to define a cost function. The cost function is a summation of the squares of the differences between each of the neuron activation output with the expected output. This is what is called the cost of a single training example.
This sum is smaller when the neuron output is closer to the expected output but large when the output is much different from the expected output.
The objecting of training a neural network is to minimize this cost function.
To get this done, we need to calculate the cost function for the total network and then minimize it.

In the formula we can also refer to it as error function which depends of the weights of the neurons. To emphasize that this cost function depends on both eh weights and the biases, we can rewrite the cost function with the formular below

C(w,b) to denote the cost function
x is the input vector
y(x) is the expected output
a is the output of the neuron

We can then rewrite the cost function  as shown in the figure below:

The objective of the gradient decent algorithm is to find the weights and biases that would make this cost function as small as possible.
To get this result, we need to take partial derivatives of the function that makes the function minimum. This is done by taking small steps in the negative direction (that is when the derivative is negative) until we get to the minimum value