Machine Learning – Stanford -Week 2 – Linear Regression

by Fuyang

Great knowledge learned from the Couresra course – Machine Learning (Stanford) taught by Andrew Ng. Here is a short summary of Week 2 ‘s content.

Gradient Descent

Perhaps the most important algorithm for linear regression. And it is also very intuitively understandable. Basically speaking the whole idea is similarly like you drop a ball on a sloped curve or surface (cost function) and ball roll down the surface following the gradient so that in the end the ball will be at the lowest point of the surface. The “surface” metaphor is used here for the situation that the cost function has two variables – sometimes called the weight. The “weight” parameters in the lecture represent by \theta, and cost function is labeled as J.

Gradient Descent 1

So when the ball reaches the lowest point, the cost function J has the minimum value, which means if we use this group of weights to do prediction / regression, we get the best guess. And sometimes, for example in another Machine Learning course on Coursera (the one from University Washington), the cost function is referred as RSS – Residual Sum of Squares, which is more or less the same thing here as J.

The above cost function J can be calculated, for example in MATLAB code:


function J = computeCostMulti(X, y, theta)
%COMPUTECOSTMULTI Compute cost for linear regression with multiple variables
% J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the
% parameter for linear regression to fit the data points in X and y

% Initialize some useful values
m = length(y); % number of training examples

J = 1/2/m * sum((X*theta-y).^2)

end

And the Gradient Descent algorithm is fairly simple, as described in the picture below, equation at upper right corner.

Gradient Descent 2

In MATLAB, it can be written a simple line of code, suppose X, y and theta are all in matrix format.


function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
%GRADIENTDESCENTMULTI Performs gradient descent to learn theta
% theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by
% taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

theta = theta - alpha/m*(X'*(X*theta - y));

% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCostMulti(X, y, theta);

end

end

Note that the sum in the equation is accomplished by matrix product operation in the code. Or, more precisely, if sample number is m and feature number is n then:

  • X is m by n+1 matrix
  • y is m by 1 matrix
  • theta is n+1 by 1 matrix
  • X*theta is m by 1 matrix
  • X’ is n+1 by m matrix

Tips on how to choose \alpha (alpha) – Learning Rate

Basically, one can start with 0.001 and make it larger by 3 times each time until the cost function may not converge. Then you may end up finding a fast and also converging alpha.

choose alpha 2

Here is pop quiz for getting the idea of the effect of different value of learning rate (or step length):

choose alpha

Feature Scaling

It’s also very important to notice about feature scaling when using gradient decent algorithm, especially when there are multiple features and some of them have very different scales. If those features are not scale or normalized, the algorithms may now work efficiently.

Feature scaling

One way to scale the features is called mean normalization, basically see pictures below.Mean normalization

Polynomial Regression

The powerful thing about gradient descent is that it is not just work for linear equations, it can be simple applied for any format of polynomial regression – one simply use higher orders of features as input features, illustrated below:

polynomial regression

One thing worth mention is that feature scaling will become fairly important in order to make the polynomial regression work, since a high order of feature will be scaled very differently than the low order  feature.

Another Method – Normal Equation

One can also choose to not using gradient decent and instead use normal equation method (called closed form solution on course from University of Washington.

The basic idea is that by setting the derivative of cost function equals zero then directly solve the equation to get values of weight parameters. One thing worth mention is that for normal equation method, feature scaling is not needed at all. However mathematically this method has the complexity of O(n^3), where n stands for feature number. Thus for a problem has many features, one should probably choose gradient descent rather than this normal equation method. More discussion on this below.

Normal Equation - Stanford1Normal Equation - Stanford2

Similar topic discussed in University of Washington course, with only two weight parameters – w0 and w1:

Normal Equation - UW

I didn’t verify but I guess the two equations from the two different lectures to calculate the weights by setting the derivative of cost function equals zero, are giving identical results, for the situation that the length of weight parameters are two.

Differences between Gradient Descent and Normal Equation

Or when to use Gradient Descent and when to use Normal Equation?

Answer is below:

Normal Equation vs Gradient Descent