On the way to achieve minimalism and essentialism

# Clustering – K-means algorithm

Continuing with Andrew Ng’s course and finally finished week 8 and delivered homework before deadline. So here again just to note some key points.

## K-means

Basically speaking the K-means method is very simple and strait forward, one only need to repeat 1. assigning all samples to one of the closest centroids; 2. move centroids to the middle of those assigned samples. However the implementation is a little tricky to make correct due to the indexing.

And it is worth to mention that not only well separated clusters, but also a group of connected well (or non-separated clusters) can be also clustered via K-means method. This can be particularly useful, for example, a T-shirt company would like to only make S, M and L size of T-shirt for all people with different height and weight.

And here is a Pop-quiz:

We could imagine that, when k is relatively small (for example k in range of 2 to 10), there is a big chance that the random initialization will put some centroids very close to each other in the beginning and result in some local optima.

The way to solve this issue is also very simple – do clustering (with random initialization) multiple times and then select the centroid group with the lowest cost.

## Choosing the value of K

Another question is how should we choose or decide how many clusters should there be for our data. Andrew suggested two method:

1. Elbow method, which is running the K from 1 to many and if we could see a clear “elbow” shape of cost curve, we can choose the K of that elbow point.
2. Based on practical or business/application specification.

## Okay, so here are some pop quiz for K-means clustering:

And here I would like to add a image compression result made from the homework result – so for this example we used K-means clustering to cluster all pixels of a image with only 8 colors, and the original image is in RGB color of 256 colors. With K=8 we could use the algorithm to cluster all pixels with similar colors into one of the 8 colors. Notably that the value of the 8 colors are not pre-defined, they will gradually become some of the major colors on the picture (while the “moving centroids step” in the k-means algorithm). After the clustering is done, we can simply do the compression by finding the nearest centroid of every pixel and then assign that pixel of that color of centroid. Since we only have 8 colors, a 3 bit storage room for each pixel will be enough, while the 8 centroids are holding a RGB color each as the “cluster center” color 🙂

# Principal Component Analysis – PCA

The motivation of principal component analysis is mainly about data compression – for example, how to casting a high dimension data down to a lower dimension space. This will help saving space of data storage, also improving certain algorithm speed, and also to help visualization since a data more than 3D, 4D will become impossible to be visualized directly.

So the so-called Principal Component Analysis (PCA) is just a method of reduce high dimension data to lower dimension data.

One thing we should notice is that even though it looks similar, but PCA and linear regression are two different algorithms and the goal is also totally not the same. For linear regress we have a value y we would like to predict, however for PCA all data are treated the same, there is no special label value.

## PCA – How to

As Andrew mentioned, the mathematical proof of the following way of doing PCA is very difficult and also beyond the scope of the course, so here we just need to know how to use it instead. It turned out to just do a dimension reduction is actually very simple. Two lines of matlab code would be enough. The key is to utilize some build in “svd” function.

And a summary of the stuff above:

Again, here a pop quiz

Another quiz:

## PCA – What can I use if for?

And some final pop quiz for under standing the topic better:

### Machine Learning – Stanford -Week 5 – Neural Networks: Implementation

Continue with the Machine Learning course by Andrew Ng. This chapter we try to implement a simple Neural Network Classification algorithm.

Below is the definition of the problem.

## Cost function

Cost function is calculated as below.

Using the equation shown above, a sample MATLAB code calculating the cost function of a one layer Neural Network,  can be like this:

```% First use forward propagation calculate the output h
a1 = [ones(m,1) X]; % m x 401
a2 = [ones(m,1) sigmoid(a1 * Theta1')]; % (m x 25) -> m x 26
a3 = sigmoid(a2 * Theta2'); % (m x 26) * (26 x 10) = m by 10

% y is m by 10
h = a3;

for m_= 1:m
a = 1:num_labels; % a is a temp para
Y = (a == y(m_)); % classification label, 1 by 10 matrix
J = J + ((-Y) * log(h(m_,:)') - (1-Y) * log(1-h(m_,:)')) ;
end

J = J/m;

% Plus regularization term
J = J + lambda/(2*m)* ( sum(sum(Theta1(:,2:end).^2)) ...
+ sum(sum(Theta2(:,2:end).^2)));
```

The following graphs are shown to illustrate the method of computing gradient.

Following the previous example code, here we can see am MATLAB implementation of a simple one layer Neural Network algorithm, with the part of code that calculate the gradient:

```D1 = zeros(size(Theta1));
D2 = zeros(size(Theta2));

% Part 2 - back propagation
for t = 1:m

% Step 1: perform forward propagation

a1 = [1 X(t,:)]; % 1 x 401

z2 = a1 * Theta1'; % 1x25
a2 = [1 sigmoid(z2)]; % (1 x 25) -&amp;gt; 1 x 26

z3 = a2 * Theta2'; % (1 x 26) * (26 x 10) = 1 by 10
a3 = sigmoid(z3) ;% 1x10

% Step 2: using y to calculate delta_L

a = 1:num_labels; % a is a temp para
Y = (a == y(t)); % making Y matrix as classification label

d3 = a3 - Y; % 1 by 10

% Step 3: backward propagation to calculate delta_L-1,
% delta_L-2, ... until delta_2. (this example only have one layer,
% so only need to calculate delta_2)

d2 = Theta2' * d3'; % 26 x 1
d2 = d2(2:end); % 25 x 1

% Alternatively:
%d2 = Theta2' * d3' .* a2' .* (1-a2)'; % 26 x 1
%d2 = d2(2:end); % 25 x 1

% Step 4: accumulate Delta value for all m input data sample
% Theta1 has size 25 x 401
% Theta2 has size 10 x 26

D2 = D2 + d3' * a2; % 10 x 26
D1 = D1 + d2 * a1; % 25 x 401

end

% Finally, calculate the gradient for all theta
Theta1_grad = 1/m*D1 + lambda/m*[zeros(size(Theta1,1),1) Theta1(:, 2:end)];
Theta2_grad = 1/m*D2 + lambda/m*[zeros(size(Theta2,1),1) Theta2(:, 2:end)];
```

Some another illustrations to show how forward propagation and backward propagation are working. You might also want to use it to understand better on how to do the algorithm implementation.

A sample of how to use advanced optimization function to find the best Theta solutions (with some vector reshape operation needed):

Here is a pop quiz about it:

A short summary about the learning algorithm procedure.

Since the Neural Network algorithm is quite complicated and might be very buggy, so one good practice is that during implementation process, one can simultaneously calculate gradients by a numerical estimation method, and check if this value close enough to the gradient calculated by the learning algorithm.

This will help to generate the bug free code. And one should remember to turn the gradient checking function off when using the learning algorithm in production environment, since this numerical estimation method is very computational expensive.

## Theta Initialization

Why do we have to use a random theta initialization?

## End of the course. Pop quiz

Congratulations, if you have followed the course so far, you will have a good understanding about Neural Network learning algorithms. If you also have done the course MATLAB exercise, you will be amazed by how “simple” it is that, just with a few lines of code, a learning algorithm can learn by itself to recognize hand written numbers.

I personally think Neural Network Learning is a very powerful tool and in the future it might have great potential to form very intelligent programs that can automatize lots of tedious works for people.

Thank you Andrew Ng for providing such a great course for everyone on the planet who is interested in machine learning. You are wonderful 🙂