Fuyang's Blog

On the way to achieve minimalism and essentialism

Tag: Data Science

Machine Learning – Stanford -Week 8 – Clustering and PCA

Clustering – K-means algorithm

Continuing with Andrew Ng’s course and finally finished week 8 and delivered homework before deadline. So here again just to note some key points.


Basically speaking the K-means method is very simple and strait forward, one only need to repeat 1. assigning all samples to one of the closest centroids; 2. move centroids to the middle of those assigned samples. However the implementation is a little tricky to make correct due to the indexing.


And it is worth to mention that not only well separated clusters, but also a group of connected well (or non-separated clusters) can be also clustered via K-means method. This can be particularly useful, for example, a T-shirt company would like to only make S, M and L size of T-shirt for all people with different height and weight.1.Unsupervised.1.K-Means.2



And here is a Pop-quiz:1.Unsupervised.1.K-Means.4.OptimizationObjective.q11.Unsupervised.1.K-Means.4.OptimizationObjective.q2


More about random initialization


We could imagine that, when k is relatively small (for example k in range of 2 to 10), there is a big chance that the random initialization will put some centroids very close to each other in the beginning and result in some local optima.

The way to solve this issue is also very simple – do clustering (with random initialization) multiple times and then select the centroid group with the lowest cost.


Choosing the value of K

Another question is how should we choose or decide how many clusters should there be for our data. Andrew suggested two method:

  1. Elbow method, which is running the K from 1 to many and if we could see a clear “elbow” shape of cost curve, we can choose the K of that elbow point.
  2. Based on practical or business/application specification.



Okay, so here are some pop quiz for K-means clustering:



And here I would like to add a image compression result made from the homework result – so for this example we used K-means clustering to cluster all pixels of a image with only 8 colors, and the original image is in RGB color of 256 colors. With K=8 we could use the algorithm to cluster all pixels with similar colors into one of the 8 colors. Notably that the value of the 8 colors are not pre-defined, they will gradually become some of the major colors on the picture (while the “moving centroids step” in the k-means algorithm). After the clustering is done, we can simply do the compression by finding the nearest centroid of every pixel and then assign that pixel of that color of centroid. Since we only have 8 colors, a 3 bit storage room for each pixel will be enough, while the 8 centroids are holding a RGB color each as the “cluster center” color 🙂result


Principal Component Analysis – PCA

The motivation of principal component analysis is mainly about data compression – for example, how to casting a high dimension data down to a lower dimension space. This will help saving space of data storage, also improving certain algorithm speed, and also to help visualization since a data more than 3D, 4D will become impossible to be visualized directly.

So the so-called Principal Component Analysis (PCA) is just a method of reduce high dimension data to lower dimension data.2.Unsupervised.2.PrincipalComponnentAnalysis(PCA).1

One thing we should notice is that even though it looks similar, but PCA and linear regression are two different algorithms and the goal is also totally not the same. For linear regress we have a value y we would like to predict, however for PCA all data are treated the same, there is no special label value.2.Unsupervised.2.PrincipalComponnentAnalysis(PCA).2

PCA – How to

As Andrew mentioned, the mathematical proof of the following way of doing PCA is very difficult and also beyond the scope of the course, so here we just need to know how to use it instead. It turned out to just do a dimension reduction is actually very simple. Two lines of matlab code would be enough. The key is to utilize some build in “svd” function.


And a summary of the stuff above:2.Unsupervised.2.PrincipalComponnentAnalysis(PCA).5


Again, here a pop quiz






How to choose which level of dimension k we can reduce to?


Another quiz:2.Unsupervised.2.PrincipalComponnentAnalysis(PCA).7.chooseK.4



PCA – What can I use if for?



And some final pop quiz for under standing the topic better:


Data Science and Machine Learning Essentials – Pop Quiz 1

The company Internet Movies, Inc. has found wide success in their streaming movie business. After many long and successful years of delivering content, they have decided to use machine learning to make their business even more successful. Luckily, they already possess a huge dataset that has grown over years and years of user activity – but they need your help to make sense of it! Answer the following questions

1. Let’s start with a simple case. Assume user Alice is a particularly good member and she makes sure to rate every movie she ever watches on the website. What machine learning approach would be better for the company to use for determining whether she would be interested in a new specific movie?

  • Supervised
  • Unsupervised


2. Bob, on the other hand, is not that much into ratings. He does watch a lot of movies, but never takes the time to rate them. For users like Bob, which of the following data can the company use to determine potential interest in a specific movie? Check all that apply.

  • Metadata of movies: actors, director, genre,etc.
  • Length of the movie
  • Popularity of the movie among other users
  • User login patterns


3. What machine learning approach should the company use for cases like Bob?

  • Supervised
  • Unsupervised


Now that the company has some idea about how to use the data, it’s time to design a classifier. Our classifier will be very simple: given a movie and a user, it will classify the movie as either “Good” or “Bad” for this user.

4. Assume all the users of the company have a very simple rule in their movie taste: they like it if Tom Cruise has the lead role. Any other data is mostly irrelevant. However, no one in the company knows about this fact. Which of the following clustering models might be able to detect this rule? Check all that apply.

  • Supervised (label: rating), with data: Director, language, genre
  • Supervised (label: rating), with data: Movie length, lead role, director
  • Unsupervised, with data: Lead role, genre, director
  • Unsupervised, with data: Number of ratings, lead role


5. Looking at the options they’re given, the board members choose to go with a supervised model with lead role as data. You become outraged. “How can you not include movie length? It’s incredibly important! Who watches a 3 hour long movie –” Your fellow data scientist interrupts you. “Yeah, I agree, but look at these initial results. You see, if we remove movie length, …” What can your colleague (correctly) say to convince you? Check all that apply.

  • “we can reduce inter-cluster dissimilarity.”
  • “we can reduce intra-cluster dissimilarity.”
  • “the model starts to work. It doesn’t work otherwise.”
  • “we can consume less memory, and the results look almost the same.”


1. Supervised

2. Choose:

  • Metadata of movies: actors, director, genre, etc.
  • Popularity of the movie among other users

Explanation:Length of the movie may or may not be considered. We are more likely to consider it if we consider very long movies or very short movies as well.

3. Unsupervised

4. Choose:

  • Supervised (label: rating), with data: Movie length, lead role, director
  • Unsupervised, with data: Lead role, movie length, rating

Explanation: For supervised learning, we already have ratings for each label. So for the data to differentiate between those with Tom Cruise or not, we must have the leading role as one of our features. For unsupervised learning, we must have the lead role as part of our data. In addition, we must have a way to know whether users like the movie or not, so we must also know the rating.

5. Choose:

  • “we can reduce intra-cluster dissimilarity.”
  • “we can consume less memory, and the results look almost the same.”

Explanation: Movie lengths are not sufficiently different between most movies in order to help us in making the clusters. Reducing intra-cluster dissimilarity is correct because having less features allows us to differentiate elements within a cluster from each other. Consuming less memory with similar results is correct because we do not need as much storage and will save money. The supervised learning model will work and has no limits on the number of features allowed.

Note: above content is from edX.


Data Science and Machine Learning Essentials – Module1- Clustering


  • Clustering groups data entities based on their feature values. Each data entity is described as a vector of one or more numeric features (x), which can be used to plot the entity as a specific point.
  • K-Means clustering is a technique in which the algorithm groups the data entities into a specified number of clusters (k). To begin, the algorithm selects k random centroid points, and assigns each data entity to a cluster based on the shortest distance to a centroid. Each centroid is then moved to the central point (i.e. the mean) of its cluster, and the distances between the data entities and the new centroids are evaluated, with entities reassigned to another cluster if it is closer. This process continues until every data entity is closer to the mean centroid of its cluster than to the centroid of any other cluster. The following image shows the result of K=Means clustering with a k value of 3:KMeans
  • Hierarchical Agglomerative Clustering is an alternative clustering technique in which the point representing each data entity begins as its own cluster. Each cluster is then merged with the cluster closest to it, and this merging process continues iteratively. The following image illustrates the process of Hierarchical Agglomerative Clustering.HierarchicalCluster
  • The distance metric used to determine how “close” points are to one another is an important aspect of clustering. The simplest way to conceptualize the entities is as points in Euclidean space (multidimensional coordinates), and measure the simple distance between the points; but other distance metrics can also be used.

Data Science and Machine Learning Essentials – Module1- Classification

Classification and Loss Function

  • For a classification function to work accurately, when operating on the data in the training and test datasets, the number of times that the sign of f(x) does not equal y must be minimized. In other words, for a single entity, if y is positive, f(x) should be positive; and if y is negative, f(x) should be negative. Formally, we need to minimize cases where y ≠ sign(f(x)) to let the classification function to work accurately.loss function
  • Because same-signed numbers when multiplied together always produce a positive, and numbers of different signs multiplied together always produce a negative, we can simplify our goal to minimize cases where yf(x) < 0; or for the whole data set iyif(xi) < 0. This general approach is known as a loss function.
  • As with regression algorithms, some classification algorithms add a regularization term to avoid over-fitting so that the function achieves a balance of accuracy and simplicity (Occam’s Razor again).loss function 2
  • Each classification algorithm (for example AdaBoost, Support Vector Machines, and Logistic Regression) uses a specific loss function implementation, and it’s this that distinguishes classification algorithms from one another.

Decision Trees and Multi-Class ClassificationDecision Trees

  • Decision Trees are classification algorithms that define a sequence of branches. At each branch intersection, the feature value (x) is compared to a specific function, and the result determines which branch the algorithm follows. All branches eventually lead to a predicted value (-1 or +1). Most decision trees algorithms have been around for a while, and many produce low accuracy. However, boosted decision trees (AdaBoost applied to a decision tree) can be very effective.
  • You can use a “one vs. all” technique to extend binary classification (which predicts a Boolean value) so that it can be used in multi-class classification. This approach involves applying multiple binary classifications (for example, “is this a chair?”, “is this a bird?”, and so on) and reviewing the results produced by f(x) for each test. Since f(x) produces a numeric result, the predicted value is a measure of confidence in the prediction (so for example, a high positive result for “is this a chair?” combined with a low positive result for “is this a bird?” and a high negative result for “is this an elephant?” indicates a high degree of confidence that the object is more likely to be a chair than a bird, and very unlikely to be an elephant.)

Imbalanced Data

  • When the training data is imbalanced (so a high proportion of the data has the same True/False value for y) , the accuracy of the classification algorithm can be compromised. To overcome this problem, you can “over-sample” or “weight” data with the minority y value to balance the algorithm.ROC

ROC – Receiver Operator Characteristic

  • The quality of a classification model can be assessed by plotting the True Positive Rate (predicted positives / actual positives) against the False Positive Rate (predicted negatives / actual negatives) for various parameter values on a chart to create a receiver operator characteristic (ROC) curve. The quality of the model is reflected in the area under the curve (AOC). The larger this area is than the area under a straight diagonal line (representing a 50% accuracy rate that can be achieved purely by guessing), the better the model; as shown below:


Data Science and Machine Learning Essentials – Module1- Regression

Simple Linear Regression

  • When operating on the data in the training and test data-sets, the difference between the known y value and the value calculated by f(x) must be minimized. In other words, for a single entity, y – f(x) should be close to zero. This is controlled by the values of any baseline variables (b) used in the function.
  • For various reasons, computers work better with smooth functions than with absolute values, so the values are squared. In other words, (y – f(x))2 should be close to zero.
  • To apply this rule to all rows in the training or test data, the squared values are summed over the whole dataset, so ∑(yi – f(xi))2 should be close to zero. This quantity is known as the sum of squares errors (or SSE). The regression algorithm minimizes this by adjusting the values of baseline variables (b) used in the function.
  • In statistics, the residual sum of squares (RSS), also known as the sum of squared residuals (SSR) or the sum of squared errors of prediction (SSE).


Ridge Regression

  • Minimizing the SSE for simple linear regression models (where there is only a single x value) generally works well, but when there are multiple x values, it can lead to over-fitting. To avoid this, the you can use ridge regression, in which the regression algorithm adds a regularization term to the SSE. This helps achieve a balance of accuracy and simplicity in the function.

Ridge Regression

  • Note those two terms up there:
    1. Minimizing the first term, is just asking the computer to keep predicting the truth based on the training set.
    2. Minimizing the second term, is like asking to keep the model “simple” – Principle of Occam’s Razor.
  • Support vector machine regression is an alternative to ridge regression that uses a similar technique to minimize the difference between the predicted f(x) values and the known y values.

Nested Cross-Validation

  • To determine the best algorithm for a specific set of data, you can use cross-validation (CV), in which the data is divided into folds, with each fold in turn being used to test algorithms that are trained on the other folds.
  • To compare algorithms, the algorithms that performed the best was the one with the best average out-of-sample performance across the 10 test folds. Also one can check the standard deviation of each folds.
  • To determine the optimal value for a parameter (k) for an algorithm, you can use nested cross-validation, in which one of the folds in the training set is used to validate all possible k values. This process is repeated so that each fold in the training set is used as a validation fold. The best result is then tested with the test fold, and the whole process is repeated again until every fold has been used as the test fold.

Nested cross-validation

Data Science and Machine Learning Essentials – Module1- Classification, Regression, Clustering and Recommendation

Classification and Regression

  • Classification and regression use data with known values to train a machine learning model so that it can identify unknown values for other data entities with similar attributes.
  • Classification is used to identify Boolean (True/False) values. Regression is used to identify real numeric values. So a question like “In this a chair?” is a classification problem, while “How much does this person earn?” is a regression problem.
  • Both classification and regression are examples of supervised learning, in which a machine learning model is trained using a set of existing, known data values. The basic principle is as follows:
  1. Define data entities based on a collection (or vector) of numeric variables (which we call features) and a single predictable value (which we call a label). In classification, the label has a value of -1 for False and +1 for True.
  2. Assemble a training set of data that contains many entities with known feature and label values – we call the set of feature values x and the label value y.
  3. Use the training set and a classification or regression algorithm to train a machine learning model to determine a function (which we call f) that operates on x to produce y.
  4. The trained model can now use the function f(x)=y to calculate the label (y) for new entities based on their feature values (x). In classification, if y is positive, the label is True; if y is negative, the label is False. The function can be visualized as a line on a chart, showing the predicted y value for a given value. The predicted values should be close to the actual known values for the training set, as shown in figure 1 below.
  5. You can evaluate the model by applying it to a set of test data with known label (y) values. The accuracy of the model can be validated by comparing the value calculated by f(x) with the known value for y, as shown in figure 2.

Figure 1: A trained model


Figure 2: A validated model

  • In a supervised learning model, the goal is to produce a function that accurately calculates the known label values for the training set, but which is also generalized enough to predict accurately for known values in a test set (and for unknown values in production data). Functions that only work accurately with the training data are referred to as “over-fitted”, and functions that are too general and don’t match the training data closely enough are referred to as “under-fitted”. In general, the best functions are ones that are complex enough to accurately reflect the overall trend of the training data, but which are simple enough to calculate accurate values for unknown labels.

Figure 3: An over-fitted model


Figure 4: An under-fitted model


  • Clustering is an unsupervised learning technique in which machine learning is used to group (or cluster) data entities based on similar features.
  • It is difficult to evaluate clustering models, because there is no training set with known values that you can use to train the model and compare its results.


  • Recommender systems are machine learning solutions that match individuals to items based on the preferences of other similar individuals, or other similar items that the individual is already known to like.
  • Recommendation is one of the most commonly used forms of machine learning.