본문 바로가기
Artificial Intelligence/Deep Learning

coursera_week2_4_Gradient Descent

by raphael3 2017. 10. 4.
반응형
우리는 지금까지 logistic regression model을 살펴보았고, loss function을 살펴보았고, cost function도 살펴보았다.
이제 training set에 대하여 파라미터인 W와 b를 학습시키기 위한 알고리즘인 Gradient Descent에 대해 알아보자.

cost function J를 최소화하는 W와 b를 찾고자 하는 것은 합리적이다. (cost function 자체가 W와 b의 함수로 정의되어 있기 때문이다.)

the cost function, J, which is a function of your parameters w and b. And that's defined as the average. So it's 1 over m times the sum of this loss function. And so the loss function measures how well your algorithms outputs y-hat(i) on each of the training examples stacks up or compares to the ground truth label y(i) on each of the training examples. And the full formula is expanded out on the right. So the cost function measures how well your parameters w and b are doing on the training set. 
In order to learn the set of parameters w and b, it seems natural that we want to find w and b that make the cost function J(w, b) as small as possible. 

here's an illustration of gradient descent. In this diagram the horizontal axes represent your spatial parameters, w and b. In practice, w can be much higher dimensional, but for the purposes of plotting, let's illustrate w as a single real number and b as a single real number. 

The cost function J(w,b) is surface above these horizontal axes w and b. the height of the surface represents the value of J(w,b) at a certain point. And what we want to do is really to find the value of w and b, this red point(표면에서 가장 낮은 지점) corresponds to the minimum of the cost function J.
It turns out that this cost function J is a convex function. it's just a single big bowl, a convex function and this is opposed to functions that look like this, which are non-convex and has lots of different local . The fact that our cost function J(w,b) as defined here is convex, and it is one of the huge reasons why we use this particular cost function, J, for logistic regression. 

To find a good value for the parameters, what we'll do is initialize w and b to some initial value, maybe denoted by that little red dot(맨 위에 찍힌 빨간 점을 말함). And for logistic regression almost any initialization method works, usually you initialize the value to zero. Random initialization also works, but people don't usually do that for logistic regression. But because this function is convex, no matter where you initialize, you should get to the same point or roughly the same point. And what gradient descent does is it starts at that initial point and then takes a step in the steepest downhill direction. 

After one step of gradient descent you might end up there, because it's trying to take a step downhill in the direction of steepest descent or as quickly downhill as possible. That is one iteration of gradient descent. 

And after two iterations of gradient descent you might step there, three iterations and so on. I guess this is now hidden by the back of the plot until eventually, hopefully you converge to this global optimum or get to something close to the global optimum. This picture illustrates the gradient descent algorithm. 

Let's write a bit more of the details. For the purpose of illustration, let's say that there is some function, J(w), that you want to minimize, and maybe that function looks like this; To make this easier to draw, I'm going to ignore b for now, just to make this a one-dimensional plot instead of a high-dimensional plot. 

gradient descent does this, we're going to repeatedly carry out the following update. We are going to take the value of w and update it, and going to use colon equals to represent updating w. So set ‘w' to 'w minus alpha, times, and this is a derivative dJ(w)/dw'; . I will repeatedly do that until the algorithm converges; 

So couple of points in the notation, alpha here, is the learning rate, and controls how big a step we take on each iteration or gradient descent; .We will talk later about some ways about choosing the learning rate, alpha. 

And second, this quantity here, this is a derivative. This is basically the update or the change you want to make to the parameters w. 

(없어도 되는 문단)When we start to write code to implement gradient descent, we are going to use the convention that the variable name in our code, "dw" will be used to represent this derivative term. So when you write code you write something like "w" := "w - alpha*dw". And we use dw to be the variable name to represent this derivative term. 

Now let's just make sure that this 'gradient descent update' makes sense. Let's say that w was over here;. you are at this point on the cost function J(w);. Remember that the definition of a derivative is the slope of a function at the point. The slope of the function() is the height divided by the width, right, , of a low triangle here at this tangent to J(w) at that point. Now, the derivative is positive. w gets updated value as w minus a learning rate times the derivative. The derivative is positive and so you end up subtracting from w, so you end up taking a step to the left. And so gradient descent will make your algorithm slowly decrease the parameter if you have started off with this large value of w;. 

As another example, if w was over here, then at this point the slope here of dJ/dw will be negative; and in the 'gradient descent update’, the part of 'subtract alpha times' would be a negative. So this end up slowly increasing w, you eventually end up making w bigger and bigger with successive iterations. hopefully whether you initialize on the left or on the right, gradient descent will move the point towards this global minimum;. 

If you have a deep knowledge of calculus, you might be able to have a deeper intuitions about how neural networks work. The overall intuition for now is that this term() represents the slope of the function, and we want to know the slope of the function at the current setting of the parameters, so that we can take these steps of steepest descent, so that we know what direction to step in in order to go downhill on the cost function J. 

In logistic regression, your cost function is a function of both w and b. In that case, the inner loop of gradient descent, that is this thing here;, this thing you have to repeat becomes as follows. You end up updating w as w minus the learning rate times the derivative of J(w,b) respect to w. And you update b as b minus the learning rate times the derivative of the cost function in respect to b. 
So these two equations are the actual update you implement. 



In the next video, we'll give you better intuition about derivatives. 


반응형