On the way to achieve minimalism and essentialism

# Clustering – K-means algorithm

Continuing with Andrew Ng’s course and finally finished week 8 and delivered homework before deadline. So here again just to note some key points.

## K-means

Basically speaking the K-means method is very simple and strait forward, one only need to repeat 1. assigning all samples to one of the closest centroids; 2. move centroids to the middle of those assigned samples. However the implementation is a little tricky to make correct due to the indexing.

And it is worth to mention that not only well separated clusters, but also a group of connected well (or non-separated clusters) can be also clustered via K-means method. This can be particularly useful, for example, a T-shirt company would like to only make S, M and L size of T-shirt for all people with different height and weight.

And here is a Pop-quiz:

We could imagine that, when k is relatively small (for example k in range of 2 to 10), there is a big chance that the random initialization will put some centroids very close to each other in the beginning and result in some local optima.

The way to solve this issue is also very simple – do clustering (with random initialization) multiple times and then select the centroid group with the lowest cost.

## Choosing the value of K

Another question is how should we choose or decide how many clusters should there be for our data. Andrew suggested two method:

1. Elbow method, which is running the K from 1 to many and if we could see a clear “elbow” shape of cost curve, we can choose the K of that elbow point.
2. Based on practical or business/application specification.

## Okay, so here are some pop quiz for K-means clustering:

And here I would like to add a image compression result made from the homework result – so for this example we used K-means clustering to cluster all pixels of a image with only 8 colors, and the original image is in RGB color of 256 colors. With K=8 we could use the algorithm to cluster all pixels with similar colors into one of the 8 colors. Notably that the value of the 8 colors are not pre-defined, they will gradually become some of the major colors on the picture (while the “moving centroids step” in the k-means algorithm). After the clustering is done, we can simply do the compression by finding the nearest centroid of every pixel and then assign that pixel of that color of centroid. Since we only have 8 colors, a 3 bit storage room for each pixel will be enough, while the 8 centroids are holding a RGB color each as the “cluster center” color 🙂

# Principal Component Analysis – PCA

The motivation of principal component analysis is mainly about data compression – for example, how to casting a high dimension data down to a lower dimension space. This will help saving space of data storage, also improving certain algorithm speed, and also to help visualization since a data more than 3D, 4D will become impossible to be visualized directly.

So the so-called Principal Component Analysis (PCA) is just a method of reduce high dimension data to lower dimension data.

One thing we should notice is that even though it looks similar, but PCA and linear regression are two different algorithms and the goal is also totally not the same. For linear regress we have a value y we would like to predict, however for PCA all data are treated the same, there is no special label value.

## PCA – How to

As Andrew mentioned, the mathematical proof of the following way of doing PCA is very difficult and also beyond the scope of the course, so here we just need to know how to use it instead. It turned out to just do a dimension reduction is actually very simple. Two lines of matlab code would be enough. The key is to utilize some build in “svd” function.

And a summary of the stuff above:

Again, here a pop quiz

Another quiz:

## PCA – What can I use if for?

And some final pop quiz for under standing the topic better:

### Machine Learning – Stanford -Week 5 – Neural Networks: Implementation

Continue with the Machine Learning course by Andrew Ng. This chapter we try to implement a simple Neural Network Classification algorithm.

Below is the definition of the problem.

## Cost function

Cost function is calculated as below.

Using the equation shown above, a sample MATLAB code calculating the cost function of a one layer Neural Network,  can be like this:

% First use forward propagation calculate the output h
a1 = [ones(m,1) X]; % m x 401
a2 = [ones(m,1) sigmoid(a1 * Theta1')]; % (m x 25) -> m x 26
a3 = sigmoid(a2 * Theta2'); % (m x 26) * (26 x 10) = m by 10

% y is m by 10
h = a3;

for m_= 1:m
a = 1:num_labels; % a is a temp para
Y = (a == y(m_)); % classification label, 1 by 10 matrix
J = J + ((-Y) * log(h(m_,:)') - (1-Y) * log(1-h(m_,:)')) ;
end

J = J/m;

% Plus regularization term
J = J + lambda/(2*m)* ( sum(sum(Theta1(:,2:end).^2)) ...
+ sum(sum(Theta2(:,2:end).^2)));

The following graphs are shown to illustrate the method of computing gradient.

Following the previous example code, here we can see am MATLAB implementation of a simple one layer Neural Network algorithm, with the part of code that calculate the gradient:

D1 = zeros(size(Theta1));
D2 = zeros(size(Theta2));

% Part 2 - back propagation
for t = 1:m

% Step 1: perform forward propagation

a1 = [1 X(t,:)]; % 1 x 401

z2 = a1 * Theta1'; % 1x25
a2 = [1 sigmoid(z2)]; % (1 x 25) -&amp;gt; 1 x 26

z3 = a2 * Theta2'; % (1 x 26) * (26 x 10) = 1 by 10
a3 = sigmoid(z3) ;% 1x10

% Step 2: using y to calculate delta_L

a = 1:num_labels; % a is a temp para
Y = (a == y(t)); % making Y matrix as classification label

d3 = a3 - Y; % 1 by 10

% Step 3: backward propagation to calculate delta_L-1,
% delta_L-2, ... until delta_2. (this example only have one layer,
% so only need to calculate delta_2)

d2 = Theta2' * d3'; % 26 x 1
d2 = d2(2:end); % 25 x 1

% Alternatively:
%d2 = Theta2' * d3' .* a2' .* (1-a2)'; % 26 x 1
%d2 = d2(2:end); % 25 x 1

% Step 4: accumulate Delta value for all m input data sample
% Theta1 has size 25 x 401
% Theta2 has size 10 x 26

D2 = D2 + d3' * a2; % 10 x 26
D1 = D1 + d2 * a1; % 25 x 401

end

% Finally, calculate the gradient for all theta
Theta1_grad = 1/m*D1 + lambda/m*[zeros(size(Theta1,1),1) Theta1(:, 2:end)];
Theta2_grad = 1/m*D2 + lambda/m*[zeros(size(Theta2,1),1) Theta2(:, 2:end)];

Some another illustrations to show how forward propagation and backward propagation are working. You might also want to use it to understand better on how to do the algorithm implementation.

A sample of how to use advanced optimization function to find the best Theta solutions (with some vector reshape operation needed):

Here is a pop quiz about it:

A short summary about the learning algorithm procedure.

Since the Neural Network algorithm is quite complicated and might be very buggy, so one good practice is that during implementation process, one can simultaneously calculate gradients by a numerical estimation method, and check if this value close enough to the gradient calculated by the learning algorithm.

This will help to generate the bug free code. And one should remember to turn the gradient checking function off when using the learning algorithm in production environment, since this numerical estimation method is very computational expensive.

## Theta Initialization

Why do we have to use a random theta initialization?

## End of the course. Pop quiz

Congratulations, if you have followed the course so far, you will have a good understanding about Neural Network learning algorithms. If you also have done the course MATLAB exercise, you will be amazed by how “simple” it is that, just with a few lines of code, a learning algorithm can learn by itself to recognize hand written numbers.

I personally think Neural Network Learning is a very powerful tool and in the future it might have great potential to form very intelligent programs that can automatize lots of tedious works for people.

Thank you Andrew Ng for providing such a great course for everyone on the planet who is interested in machine learning. You are wonderful 🙂

### Machine Learning – Stanford -Week 2 – Linear Regression

Great knowledge learned from the Couresra course – Machine Learning (Stanford) taught by Andrew Ng. Here is a short summary of Week 2 ‘s content.

Perhaps the most important algorithm for linear regression. And it is also very intuitively understandable. Basically speaking the whole idea is similarly like you drop a ball on a sloped curve or surface (cost function) and ball roll down the surface following the gradient so that in the end the ball will be at the lowest point of the surface. The “surface” metaphor is used here for the situation that the cost function has two variables – sometimes called the weight. The “weight” parameters in the lecture represent by $\theta$, and cost function is labeled as J.

So when the ball reaches the lowest point, the cost function J has the minimum value, which means if we use this group of weights to do prediction / regression, we get the best guess. And sometimes, for example in another Machine Learning course on Coursera (the one from University Washington), the cost function is referred as RSS – Residual Sum of Squares, which is more or less the same thing here as J.

The above cost function J can be calculated, for example in MATLAB code:

function J = computeCostMulti(X, y, theta)
%COMPUTECOSTMULTI Compute cost for linear regression with multiple variables
% J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the
% parameter for linear regression to fit the data points in X and y

% Initialize some useful values
m = length(y); % number of training examples

J = 1/2/m * sum((X*theta-y).^2)

end

And the Gradient Descent algorithm is fairly simple, as described in the picture below, equation at upper right corner.

In MATLAB, it can be written a simple line of code, suppose X, y and theta are all in matrix format.

function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
% taking num_iters gradient steps with learning rate alpha

% Initialize some useful values
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

theta = theta - alpha/m*(X'*(X*theta - y));

% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCostMulti(X, y, theta);

end

end

Note that the sum in the equation is accomplished by matrix product operation in the code. Or, more precisely, if sample number is m and feature number is n then:

• X is m by n+1 matrix
• y is m by 1 matrix
• theta is n+1 by 1 matrix
• X*theta is m by 1 matrix
• X’ is n+1 by m matrix

## Tips on how to choose $\alpha$ (alpha) – Learning Rate

Basically, one can start with 0.001 and make it larger by 3 times each time until the cost function may not converge. Then you may end up finding a fast and also converging alpha.

Here is pop quiz for getting the idea of the effect of different value of learning rate (or step length):

## Feature Scaling

It’s also very important to notice about feature scaling when using gradient decent algorithm, especially when there are multiple features and some of them have very different scales. If those features are not scale or normalized, the algorithms may now work efficiently.

One way to scale the features is called mean normalization, basically see pictures below.

## Polynomial Regression

The powerful thing about gradient descent is that it is not just work for linear equations, it can be simple applied for any format of polynomial regression – one simply use higher orders of features as input features, illustrated below:

One thing worth mention is that feature scaling will become fairly important in order to make the polynomial regression work, since a high order of feature will be scaled very differently than the low order  feature.

## Another Method – Normal Equation

One can also choose to not using gradient decent and instead use normal equation method (called closed form solution on course from University of Washington.

The basic idea is that by setting the derivative of cost function equals zero then directly solve the equation to get values of weight parameters. One thing worth mention is that for normal equation method, feature scaling is not needed at all. However mathematically this method has the complexity of O(n^3), where n stands for feature number. Thus for a problem has many features, one should probably choose gradient descent rather than this normal equation method. More discussion on this below.

Similar topic discussed in University of Washington course, with only two weight parameters – w0 and w1:

I didn’t verify but I guess the two equations from the two different lectures to calculate the weights by setting the derivative of cost function equals zero, are giving identical results, for the situation that the length of weight parameters are two.

## Differences between Gradient Descent and Normal Equation

Or when to use Gradient Descent and when to use Normal Equation?

### Data Science and Machine Learning Essentials – Pop Quiz 1

The company Internet Movies, Inc. has found wide success in their streaming movie business. After many long and successful years of delivering content, they have decided to use machine learning to make their business even more successful. Luckily, they already possess a huge dataset that has grown over years and years of user activity – but they need your help to make sense of it! Answer the following questions

1. Let’s start with a simple case. Assume user Alice is a particularly good member and she makes sure to rate every movie she ever watches on the website. What machine learning approach would be better for the company to use for determining whether she would be interested in a new specific movie?

• Supervised
• Unsupervised

2. Bob, on the other hand, is not that much into ratings. He does watch a lot of movies, but never takes the time to rate them. For users like Bob, which of the following data can the company use to determine potential interest in a specific movie? Check all that apply.

• Metadata of movies: actors, director, genre,etc.
• Length of the movie
• Popularity of the movie among other users

3. What machine learning approach should the company use for cases like Bob?

• Supervised
• Unsupervised

Now that the company has some idea about how to use the data, it’s time to design a classifier. Our classifier will be very simple: given a movie and a user, it will classify the movie as either “Good” or “Bad” for this user.

4. Assume all the users of the company have a very simple rule in their movie taste: they like it if Tom Cruise has the lead role. Any other data is mostly irrelevant. However, no one in the company knows about this fact. Which of the following clustering models might be able to detect this rule? Check all that apply.

• Supervised (label: rating), with data: Director, language, genre
• Supervised (label: rating), with data: Movie length, lead role, director
• Unsupervised, with data: Lead role, genre, director
• Unsupervised, with data: Number of ratings, lead role

5. Looking at the options they’re given, the board members choose to go with a supervised model with lead role as data. You become outraged. “How can you not include movie length? It’s incredibly important! Who watches a 3 hour long movie –” Your fellow data scientist interrupts you. “Yeah, I agree, but look at these initial results. You see, if we remove movie length, …” What can your colleague (correctly) say to convince you? Check all that apply.

• “we can reduce inter-cluster dissimilarity.”
• “we can reduce intra-cluster dissimilarity.”
• “the model starts to work. It doesn’t work otherwise.”
• “we can consume less memory, and the results look almost the same.”

1. Supervised

2. Choose:

• Metadata of movies: actors, director, genre, etc.
• Popularity of the movie among other users

Explanation:Length of the movie may or may not be considered. We are more likely to consider it if we consider very long movies or very short movies as well.

3. Unsupervised

4. Choose:

• Supervised (label: rating), with data: Movie length, lead role, director
• Unsupervised, with data: Lead role, movie length, rating

Explanation: For supervised learning, we already have ratings for each label. So for the data to differentiate between those with Tom Cruise or not, we must have the leading role as one of our features. For unsupervised learning, we must have the lead role as part of our data. In addition, we must have a way to know whether users like the movie or not, so we must also know the rating.

5. Choose:

• “we can reduce intra-cluster dissimilarity.”
• “we can consume less memory, and the results look almost the same.”

Explanation: Movie lengths are not sufficiently different between most movies in order to help us in making the clusters. Reducing intra-cluster dissimilarity is correct because having less features allows us to differentiate elements within a cluster from each other. Consuming less memory with similar results is correct because we do not need as much storage and will save money. The supervised learning model will work and has no limits on the number of features allowed.

Note: above content is from edX.

# Simple Linear Regression

• When operating on the data in the training and test data-sets, the difference between the known y value and the value calculated by f(x) must be minimized. In other words, for a single entity, y – f(x) should be close to zero. This is controlled by the values of any baseline variables (b) used in the function.
• For various reasons, computers work better with smooth functions than with absolute values, so the values are squared. In other words, (y – f(x))2 should be close to zero.
• To apply this rule to all rows in the training or test data, the squared values are summed over the whole dataset, so ∑(yi – f(xi))2 should be close to zero. This quantity is known as the sum of squares errors (or SSE). The regression algorithm minimizes this by adjusting the values of baseline variables (b) used in the function.
• In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared errors of prediction (SSE).

# Ridge Regression

• Minimizing the SSE for simple linear regression models (where there is only a single x value) generally works well, but when there are multiple x values, it can lead to over-fitting. To avoid this, the you can use ridge regression, in which the regression algorithm adds a regularization term to the SSE. This helps achieve a balance of accuracy and simplicity in the function.

• Note those two terms up there:
1. Minimizing the first term, is just asking the computer to keep predicting the truth based on the training set.
2. Minimizing the second term, is like asking to keep the model “simple” – Principle of Occam’s Razor.
• Support vector machine regression is an alternative to ridge regression that uses a similar technique to minimize the difference between the predicted f(x) values and the known y values.

# Nested Cross-Validation

• To determine the best algorithm for a specific set of data, you can use cross-validation (CV), in which the data is divided into folds, with each fold in turn being used to test algorithms that are trained on the other folds.
• To compare algorithms, the algorithms that performed the best was the one with the best average out-of-sample performance across the 10 test folds. Also one can check the standard deviation of each folds.
• To determine the optimal value for a parameter (k) for an algorithm, you can use nested cross-validation, in which one of the folds in the training set is used to validate all possible k values. This process is repeated so that each fold in the training set is used as a validation fold. The best result is then tested with the test fold, and the whole process is repeated again until every fold has been used as the test fold.

# Classification and Regression

• Classification and regression use data with known values to train a machine learning model so that it can identify unknown values for other data entities with similar attributes.
• Classification is used to identify Boolean (True/False) values. Regression is used to identify real numeric values. So a question like “In this a chair?” is a classification problem, while “How much does this person earn?” is a regression problem.
• Both classification and regression are examples of supervised learning, in which a machine learning model is trained using a set of existing, known data values. The basic principle is as follows:
1. Define data entities based on a collection (or vector) of numeric variables (which we call features) and a single predictable value (which we call a label). In classification, the label has a value of -1 for False and +1 for True.
2. Assemble a training set of data that contains many entities with known feature and label values – we call the set of feature values x and the label value y.
3. Use the training set and a classification or regression algorithm to train a machine learning model to determine a function (which we call f) that operates on x to produce y.
4. The trained model can now use the function f(x)=y to calculate the label (y) for new entities based on their feature values (x). In classification, if y is positive, the label is True; if y is negative, the label is False. The function can be visualized as a line on a chart, showing the predicted y value for a given value. The predicted values should be close to the actual known values for the training set, as shown in figure 1 below.
5. You can evaluate the model by applying it to a set of test data with known label (y) values. The accuracy of the model can be validated by comparing the value calculated by f(x) with the known value for y, as shown in figure 2.

Figure 1: A trained model

Figure 2: A validated model

• In a supervised learning model, the goal is to produce a function that accurately calculates the known label values for the training set, but which is also generalized enough to predict accurately for known values in a test set (and for unknown values in production data). Functions that only work accurately with the training data are referred to as “over-fitted”, and functions that are too general and don’t match the training data closely enough are referred to as “under-fitted”. In general, the best functions are ones that are complex enough to accurately reflect the overall trend of the training data, but which are simple enough to calculate accurate values for unknown labels.

Figure 3: An over-fitted model

Figure 4: An under-fitted model

# Clustering

• Clustering is an unsupervised learning technique in which machine learning is used to group (or cluster) data entities based on similar features.
• It is difficult to evaluate clustering models, because there is no training set with known values that you can use to train the model and compare its results.

# Recommendation

• Recommender systems are machine learning solutions that match individuals to items based on the preferences of other similar individuals, or other similar items that the individual is already known to like.
• Recommendation is one of the most commonly used forms of machine learning.