Great knowledge learned from the Couresra course – Machine Learning (Stanford) taught by Andrew Ng. Here is a short summary of Week 2 ‘s content.
Perhaps the most important algorithm for linear regression. And it is also very intuitively understandable. Basically speaking the whole idea is similarly like you drop a ball on a sloped curve or surface (cost function) and ball roll down the surface following the gradient so that in the end the ball will be at the lowest point of the surface. The “surface” metaphor is used here for the situation that the cost function has two variables – sometimes called the weight. The “weight” parameters in the lecture represent by , and cost function is labeled as J.
So when the ball reaches the lowest point, the cost function J has the minimum value, which means if we use this group of weights to do prediction / regression, we get the best guess. And sometimes, for example in another Machine Learning course on Coursera (the one from University Washington), the cost function is referred as RSS – Residual Sum of Squares, which is more or less the same thing here as J.
The above cost function J can be calculated, for example in MATLAB code:
function J = computeCostMulti(X, y, theta) %COMPUTECOSTMULTI Compute cost for linear regression with multiple variables % J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the % parameter for linear regression to fit the data points in X and y % Initialize some useful values m = length(y); % number of training examples J = 1/2/m * sum((X*theta-y).^2) end
And the Gradient Descent algorithm is fairly simple, as described in the picture below, equation at upper right corner.
In MATLAB, it can be written a simple line of code, suppose X, y and theta are all in matrix format.
function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters) %GRADIENTDESCENTMULTI Performs gradient descent to learn theta % theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by % taking num_iters gradient steps with learning rate alpha % Initialize some useful values m = length(y); % number of training examples J_history = zeros(num_iters, 1); for iter = 1:num_iters theta = theta - alpha/m*(X'*(X*theta - y)); % ============================================================ % Save the cost J in every iteration J_history(iter) = computeCostMulti(X, y, theta); end end
Note that the sum in the equation is accomplished by matrix product operation in the code. Or, more precisely, if sample number is m and feature number is n then:
- X is m by n+1 matrix
- y is m by 1 matrix
- theta is n+1 by 1 matrix
- X*theta is m by 1 matrix
- X’ is n+1 by m matrix
Tips on how to choose (alpha) – Learning Rate
Basically, one can start with 0.001 and make it larger by 3 times each time until the cost function may not converge. Then you may end up finding a fast and also converging alpha.
Here is pop quiz for getting the idea of the effect of different value of learning rate (or step length):
It’s also very important to notice about feature scaling when using gradient decent algorithm, especially when there are multiple features and some of them have very different scales. If those features are not scale or normalized, the algorithms may now work efficiently.
One way to scale the features is called mean normalization, basically see pictures below.
The powerful thing about gradient descent is that it is not just work for linear equations, it can be simple applied for any format of polynomial regression – one simply use higher orders of features as input features, illustrated below:
One thing worth mention is that feature scaling will become fairly important in order to make the polynomial regression work, since a high order of feature will be scaled very differently than the low order feature.
Another Method – Normal Equation
One can also choose to not using gradient decent and instead use normal equation method (called closed form solution on course from University of Washington.
The basic idea is that by setting the derivative of cost function equals zero then directly solve the equation to get values of weight parameters. One thing worth mention is that for normal equation method, feature scaling is not needed at all. However mathematically this method has the complexity of O(n^3), where n stands for feature number. Thus for a problem has many features, one should probably choose gradient descent rather than this normal equation method. More discussion on this below.
Similar topic discussed in University of Washington course, with only two weight parameters – w0 and w1:
I didn’t verify but I guess the two equations from the two different lectures to calculate the weights by setting the derivative of cost function equals zero, are giving identical results, for the situation that the length of weight parameters are two.
Differences between Gradient Descent and Normal Equation
Or when to use Gradient Descent and when to use Normal Equation?
Answer is below: