Machine Learning – Stanford -Week 3 – Logistic Regression
Classification by regression
Great things learned again from Coursera Machine Learning course, taught by Andrwe Ng. Here are some of the key notes. Perhaps the first thing to notice here is that, even though the title of this article seems to be about regression, but it is actually a way of doing classification. So I think we could simply saying that “logistic regression” is a “classification” method.
How we got the idea?…(some background info)
If we follow from the previous linear regression course, we can naturally ask, can one use linear regression for Classification?
Answer is yes, like the simple example above where a hypothesis line is generated to do classification in a way that: if h larger than 0.5, predict class A, if h less than 0.5, predict class B.
However there are many limitations with this way. An apparent one is that when some cases appears, such as if a training sample appear on the far right of the graph (malignant tumor with very large size), the h function’s slop changes and will result in that some of other prediction to be wrong.
Logistic Function (or Sigmoid Function)
In order to solve this, we can shape the hypothesis function h as a logistic function, defined as follow, where the value of h can only change between 0 and 1.
When using logistic function as the hypothesis, we have to change the format of cost function a bit. Because for the previous way of defining cost function, the cost function will have many “local minimum” introduced because using logistic function. And this won’t help the gradient descent algorithm to work properly.
Thus, we can defined the following cost function, so that is will be come “convex” and “gradient descent friendly”.
And, lucky or mathematically, the above cost function can be defined or re-written in the math format as following:
So by now, we have fully defined a simple classification problem and a group of classification hypothesis and cost function are ready to use.
And here is what the gradient descent algorithm looks like for Logistic Regression.
And here is a MATLAB version of the implementation of the algorithm above:
function [J, grad] = costFunction(theta, X, y) %COSTFUNCTION Compute cost and gradient for logistic regression % J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the % parameter for logistic regression and the gradient of the cost % w.r.t. to the parameters. % Initialize some useful values m = length(y); % number of training examples h=sigmoid(X*theta); % m by 1 J = 1/m * ((-y')*log(h) - (1-y')*log(1-h)); % 1 by 1 grad = 1/m * X'*(h-y); % n by 1, gradient end
And a few more words (or pictures) discuss the interpretation of the hypothesis function output – basically speaking, it output a probability measure.
Continue the interpretation above, here we define the concept of “decision boundary” (which is a line for the case of 2 features):
And it can be non-linear, high order as well:
So for the case of more than 2 classes, one can use the so called “one-vs-all” (or “one – vs-rest”) method to finish the job.
A few more words about some other optimization algorithm (such as fminunc in Qctave):