### Data Science and Machine Learning Essentials – Module1- Classification

#### by Fuyang

**Classification and Loss Function**

- For a classification function to work accurately, when operating on the data in the training and test datasets, the number of times that the sign of
does not equal**f(x)**must be minimized. In other words, for a single entity, if**y**is positive,**y**should be positive; and if**f(x)**is negative,**y**should be negative. Formally, we need to minimize cases where**f(x)**.**y ≠ sign(f(x))**to let the classification function to work accurately - Because same-signed numbers when multiplied together always produce a positive, and numbers of different signs multiplied together always produce a negative, we can simplify our goal to minimize cases where
**yf(x) <****0**; or for the whole data set. This general approach is known as a**∑**_{i}y_{i}f(x_{i}) < 0**loss function**. - As with regression algorithms, some classification algorithms add a regularization term to avoid over-fitting so that the function achieves a balance of accuracy and simplicity (Occam’s Razor again).
- Each classification algorithm (for example AdaBoost, Support Vector Machines, and Logistic Regression) uses a specific loss function implementation, and it’s this that distinguishes classification algorithms from one another.

**Decision Trees and Multi-Class Classification**

**Decision Trees**are classification algorithms that define a sequence of branches. At each branch intersection, the feature value () is compared to a specific function, and the result determines which branch the algorithm follows. All branches eventually lead to a predicted value (-1 or +1). Most decision trees algorithms have been around for a while, and many produce low accuracy. However, boosted decision trees (AdaBoost applied to a decision tree) can be very effective.**x**- You can use a “one vs. all” technique to extend binary classification (which predicts a Boolean value) so that it can be used in
**multi-class classification**. This approach involves applying multiple binary classifications (for example, “is this a chair?”, “is this a bird?”, and so on) and reviewing the results produced byfor each test. Since**f(x)**produces a numeric result, the predicted value is a measure of confidence in the prediction (so for example, a high positive result for “is this a chair?” combined with a low positive result for “is this a bird?” and a high negative result for “is this an elephant?” indicates a high degree of confidence that the object is more likely to be a chair than a bird, and very unlikely to be an elephant.)**f(x)**

**Imbalanced Data**

- When the training
**data is imbalanced**(so a high proportion of the data has the same True/False value for) , the accuracy of the classification algorithm can be compromised. To overcome this problem, you can “over-sample” or “weight” data with the minority**y**value to balance the algorithm.*y*

**ROC – Receiver Operator Characteristic**

- The quality of a classification model can be assessed by plotting the
*True Positive Rate*(predicted positives / actual positives) against the*False Positive Rate*(predicted negatives / actual negatives) for various parameter values on a chart to create a**receiver operator characteristic (ROC) curve**. The quality of the model is reflected in the**area under the curve (AOC)**. The larger this area is than the area under a straight diagonal line (representing a 50% accuracy rate that can be achieved purely by guessing), the better the model; as shown below: