Comprehensive Notes on Andrew Ng’s Machine Learning Course

Apr 25, 2024 · 21 min read ·

Share on:

Overview

Based on NG's Machine Learning course

C1-Supervised Machine Learning - Regression and Classification

Regression: Fitting a function that predicts a continuous or probabilistic output from input variables. E.g. linear regression, polynomial regression.

P25 Feature scaling

Features may have different units and magnitudes, algorithms perform better when features are on a comparable scale.

Prevent domination by large-scale features, algorithms like gradient descent, KNN, and clustering can be biased toward features with larger numerical values.
Faster convergence in optimization.

Common methods of feature scaling:

Min-max normalization, keep data in a range, it is sensitive to outliers.
Standardization or Z-score normalization. Centers data at mean 0, standard deviation to 1. Works better when data has outliers.
Robust scaling (using median and IQR), good for datasets with many outliers.

Mean normalization: $x_i:=\dfrac{x_i - \mu_i}{max-min}$

Z-score Normalization $x_i:=\dfrac{x_i - \mu_i}{\sigma_i}$

All features will have a mean of 0 and a standard deviation of 1. $\sigma$ is the standard deviation$

The scaled features get very accurate results much, much faster.

P29 Feature Engineering, P30 Multiple linear regression

Feature Engineering: Using intuition to design new features, by transforming or combining original features. So the machine learning model can understand the data better and perform more accurately.

It turns raw sensory signals into meaningful information that makes patterns easier for a model to learn.

quadratic function: 二次方程, cubic function: 三次方程

When doing feature engineering, the feature scaling becomes increasingly important.

Gradient descent is picking the 'correct' features for us by emphasizing its associated parameter.

P31 Motivation

Binary classification, negative/positive class

P32 Logistic regression

Logistic regression is a supervised learning algorithm used for binary classification. Instead of predicting a continuous value like in linear regression, it predicts the probability that a sample belongs to a class.

Logistic regression assumes that input features have a linear relationship with some underlying score.
But probabilities must be between 0 and 1, so we pass that score through a sigmoid function.

tumor: 肿瘤 malignant 恶性的

Logistic function is sigmoid function

$g(z)=\dfrac{1}{1+e^{(-z)}}$

This maps any real number $z$ into the range (0, 1). Maps values to probabilities. Logistic means logical reasoning or calculation.

This S-shape curve is called the logistic curve, originally used in population growth modeling in the 19th century.

Logistic regression model:

$f_{w,b}{x}=g(w\cdot x+b)=\dfrac{1}{1+e^{-(w\cdot x + b)}}$

$f_{w,b}(x)=P(y=1|x;w,b)$ , That means: Probability that $y$ is 1, given input $x$, parameters $w,b$

Even though logistic regression deals with classification, not continuous values, it is still called regression because it models a continuous probability, then we threshold that probability. So logistic regression is regression in form, classification in purpose.

P33 Decision Boundary

A decision boundary is a line, plane, or hypersurface that separates different classes predicted by a classifier.

Decision boundary: $z=w\cdot x + b=0$

Non-linear decision boundaries,

P34 Cost Function for Logistic Regression

Loss function: $L(f_{w,b}(x),y)$

If y = 1 $= -\log(f_{w,b}(x))$

If y = 0 $= -\log(1-f_{w,b}(x))$

P35 Simplified Loss function

Loss function measures the error for a single training example:

$L(f_{w,b}(x,y))=-y\log(f_{w,b}(x))-(1-y)\log(1-f_{w,b}(x))$

Cost function measures the average error across the entire training set:

$J(w,b)=\dfrac{1}{m}\sum[L(f_{w,b}(x),y)]=-\dfrac{1}{m}\sum[y\log(f_{w,b}(x))-(1-y)\log(1-f_{w,b}(x))]$

Why choose this as the cost function, it derives from statistics using maximum likelihood estimation, which is an idea from statistics on how to efficiently find parameters from different models.

P37 Overfitting

Overfitting happens when a model learns the training data too well, including its noise, outliers, and random fluctuations, instead of just the underlying pattern.

Underfit, high bias; Overfit, high variance

Causes of overfitting:

Too complex model
Too few training samples
Noisy data
Too many training epochs without regularization

P38 Addressing Overfitting

More training data, collect more data.
REgularization (L1, L2 penalties)
Dropout in neural networks
Early stopping, stop training before overfitting.
Cross validation to check generalization
Simplify the model, reduce depth, features, parameters.
Feature selection, drop redundant or irrelevant features, use domain knowledge or use statistical methods. All features + insufficient data = Overfitting, this is the curse of dimensionality.

P39 Regularization

Regularization discourages a model from becoming too complex.

A model with very large weights $w$ tends to fit the training data too perfectly, including noise.
Regularization adds a penalty to the cost function, so the model prefers simpler weights (smaller values).

$J(w,b)=\dfrac{1}{2m}\sum(f(x)-y)^2 + \dfrac{\lambda}{2m}\sum w^2, \lambda > 0$

L2 regularization (Ridge):

Adds the sum of squared weights to the cost function.
Shrinks weights smoothly, keeps all features but reduces influence.
Decision boundary becomes smoother.

$J(w,b)=\dfrac{1}{2m}\sum(f(x)-y)^2 + \dfrac{\lambda}{m}\sum |w|, \lambda > 0$

L1 regularization (Lasso)

Adds the sum of absolute weights to the cost function.
Some weights shrink to zero, automatic feature selection.
Creates sparse models.

P40 Regularization for Linear Regression

In linear regression, if we have too many features or correlated features, weights can be come very large. Apply regularization prevents overfitting:

L2: small weights smooth solution.
L1: sparse weights, feature selection.
Elastic Net: mix of both

C2-Advanced Learning Algorithms

P45 Requirement prediction

Inference (prediction); activation; activation values;

$a = f(x) = \dfrac{1}{1+e^{wx+b}}$

$a^{[1]}$ denotes the output (activation value) of layer 1. input layer is layer 0.

$a_j^{[l]}=g(w_j^{[l]}\cdot a^{[l-1]}+b_j^[l])$

$g$ is the activation function

P51

Matrix 2x3 in Numpy x = np.array([[1,2,3], [4,5,6]]) Matrix is 2D array.

Vector is 1D array. x = np.array([200, 17]), just a list.

P61

In binary classification problem, The logistic loss function, also known as binary cross entropy is used to measure the performance:

$ L(y, \hat{y}) = -\left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right] $

Penalizes confident but wrong predictions heavily.
Rewards predictions that are close to the true class.
Ensures smooth and differentiable optimization for training neural networks or logistic regression models.

Compute derivatives for gradient descent using back propagation.

P62 Activation function

The purpose of activation function is to introduce nonlinearity, allowing the network to learn complex relationships between inputs and outputs. Without activation function, the entire neural network would just be a linear function, no matter how many layers it had.

Common activation functions:

Function	Formula	Output Range	Key Characteristics
Sigmoid (Logistic)	$\sigma(z) = \frac{1}{1 + e^{-z}}$	(0, 1)	Smooth “S” curve, good for probabilities (used in logistic regression)
Tanh	$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$	(-1, 1)	Centered around 0, stronger gradients than sigmoid
ReLU (Rectified Linear Unit)	$f(z) = \max(0, z)$	[0, ∞)	Very popular — fast, sparse, reduces vanishing gradients
Leaky ReLU	$f(z) = \max(0.01z, z)$	(-∞, ∞)	Fixes ReLU’s “dead neuron” problem
Softmax	$f_i(z) = \frac{e^{z_i}}{\sum_j e^{z_j}}$	(0, 1), sum=1	Converts vector to probability distribution for multi-class classification

Commen used one is ReLU (rectified linear unit) $g(z)=\max (0,z)$.

In backbones, typical choice is ReLU or Leaky ReLU, they are simple and computationally cheap, important for real-time inference, leaky ReLU avoids dead neurons when inputs are negative.
In necks, ReLU is used, smooth gradient flow is critical for training stability, especially during feature fusion.
In detection head
- For multi-label classification, use Sigmoid. Each class is treated independently, can detect overlapping categories.
- For single label classification, use Softmax, ensures probabilities sum to 1, only one class per proposal.

P63

In classification problems, the sigmoid function (binary) or softmax (multi-class) is a natural choice for the output layer since they produce probabilities.

In regression, a linear activation is used for unrestricted outputs, or ReLU/Softplus if outputs must be non-negative.

For hidden layers, ReLU (or its variants) is most commonly used today because it’s fast, simple, and avoids vanishing gradients, while sigmoid is rarely used due to its two flat regions that slow down gradient descent.

P66 Multiclass, Softmax

Softmax is used for multiclass classification problem

Softmax regression, N possible outputs

$z_j = w_j \cdot x + b_j, j=1,...,N$

Activation function: $a_j=\dfrac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}=\mathbf{P}(y=j|x)$

$a_1 + ... + a_N=1$

If use softmax for binary classification, it is same as sigmoid activation.

P67, P68

For digit recognition problem, use softmax as output layer, and use 10 neutron to form the output layer.

Loss function name in TF is SparseCategoricalCrossentropy.

TF, from_logits=True will make the round off more accurate when with softmax.

model.compile(loss=SparseCategoricalCrossEntropy(from_logits=True))

P69 Multi-label Classification

Can use sigmoid activations function for the output layer.

P79 Advance Optimization Algorithm (Adam)

Adam (Adaptive moment estimation) algorithm is one of the most popular optimization algorithm. It automatically adjusts the learning rate alpha. It not just one alpha, but each neutro has one learning rate.

It combines the advantages of two other methods:

AdaGrad: adapts learning rates for each parameter individually
RMSProp: smooths learning rate using a moving average of squared gradients So Adam is essentially adaptive learning rate + momentum. It keeps track of both average gradient and average squared gradient, making training efficient and robust.

In gradient descent, updates are simple: $\theta_t = \theta_{t-1} - \alpha \nabla_{\theta} J(\theta)$, But this can be slow or unstable because:

Some parameters may need smaller steps (steep gradients)
Others need larger steps (flat regions)
Gradients may oscillate in certain directions

Adam automatically adjusts the learning rate for each parameter based on gradient history, making training faster and more stable.

If w or b keeps moving in same direction, increase alpha. If keeps oscillating, reduce alpha.

In code, model compile select optimizer use Adam. It uses one default init learning rate.

It is better than gradient descent algorithm.

Algorithm Steps:

For parameter $\theta$, compute gradient: $g_t = \nabla_\theta J_t \theta$
Update moment estimates:
1. Fist moment (mean of gradients): $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t$
2. Second moment (uncentered variance): $v_t = \beta_2 v_{t-1} + (1-\beta_2)g^2_t$
Bias correction (for initialization bias):
1. $\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$
2. $\hat{v}_t = \frac{v_t}{1 - \beta_2^t}$
Parameter update: $\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

Symbol	Meaning	Default
( $\alpha$ )	Learning rate	0.001
( $\beta_1$ )	Decay for first moment	0.9
( $\beta_2$ )	Decay for second moment	0.999
( $\epsilon$ )	Small constant for stability	1e-8

Pros:

Fast convergence
Works well in practice
Handles sparse gradients
Combines benefits of Momentum and RMSProp

Cons:

Can overfit if learning rate isn't tuned
Sometimes converges to slightly worse minima than SGD with momentum
Requires more memory

Algorithm	Type	What it does
SGD (Stochastic Gradient Descent)	Optimization algorithm	Updates weights in the opposite direction of gradient using random mini-batches
Adam (Adaptive Moment Estimation)	Optimization algorithm	Uses momentum + adaptive learning rate for each parameter
RMSProp, AdaGrad, AdamW, AdamP, etc.	Optimization variants	Improve convergence speed, stability, or generalization

P71 Convolutional Layer

A convolutional layer is a special type of neural network layer designed to automatically learn spatial patterns in data. Instead of connecting every neuron to every input pixel like a fully connected layer, a convolutional layer only looks at a small local region (receptive field) and slides that filter across the entire image.

Faster computation
Need less training data, less prone to overfitting.
Learns local spatial features
Weight sharing, same filter slides over the whole image, fewer parameters than fully connected layers.
Translation invariance: detects the same pattern anywhere in the image.

Key Hyperparameters

Parameter	Meaning	Example
Filter size (kernel)	size of sliding window	3×3, 5×5
Stride	how far the filter moves each step	1, 2
Padding	add zeros around input borders to preserve size	"same" or "valid"
Number of filters	number of output feature maps	16, 32, 64, ...

Structure of a Conv Layer: input layer -> Conv2D -> activation -> Pooling -> Next layer

Conv2D: performs convolution (learns filters)
Activation (e.g., ReLU): introduces non-linearity
Pooling: reduces spatial size (downsampling)

P75 Diagnostic

Diagnostic: A test that you run to gain insight into what is/isn't working with a learning algorithm, to gain guidance into improving its performance.

Common Diagnostics Examples:

Diagnostic	What It Tells You	How It Helps
Training vs. validation error	Whether your model has high bias or high variance	Guides whether to add data, regularization, or model capacity
Learning curve	How error changes with more training data	Shows if adding data would help
Error analysis (confusion matrix, misclassified samples)	Which classes or examples cause mistakes	Helps in data cleaning or model fine-tuning
Gradient checking	Whether your implementation of backprop is correct	Detects bugs in training code
Training loss vs. time	Whether learning is progressing or diverging	Helps tune learning rate or optimizer
Feature importance / ablation test	Which features affect predictions most	Guides feature engineering or pruning

P77 Model Selection

Training set, cross validation data, test set.

Cross-validation (CV) is a model evaluation technique used to test how well your machine learning model generalizes to unseen data.

Evaluate a model using cross validation data during training, and use testing set to estimate generalization error.

Cross validation reduces randomness
Gives a more robust estimate of the model's expected performance on new data.

$k$-Fold Cross-Validation:

Split dataset into $k$ equal parts
For each fold $i$:
1. Train on $k-1$ folds
2. Test on the remaining fold
Compute the performance (accuracy, F1, loss, etc.) for each fold
Average the results and get overall performance metric.

Variants of Cross-Validation

Type	Description	Use Case
k-Fold	Split into k equal parts	Most common (e.g., k = 5 or 10)
Stratified k-Fold	Keeps class ratios the same in each fold	Classification with imbalanced classes
Leave-One-Out (LOOCV)	Each data point is a fold (k = N)	Very small datasets
Time-Series CV	Uses only past data to predict future data	Sequential/time-dependent data

P78 Bias/Variance

$Expected Error=Bias^2+Variance+Irreducible Noise$

Bias is error from wrong assumptions (underfitting)
Variance is error from sensitivity to training data (overfitting)

high bias - underfit J_train is high; J_cv is high

high variance (overfit) J_train is low; J_cv is high

Bias–Variance Tradeoff:

Model Behavior	Bias	Variance	Result
Underfitting	High	Low	Model too simple (misses patterns)
Good fit	Low	Low	Balanced — best generalization
Overfitting	Low	High	Model too complex (memorizes noise)

P80 Establishing a Baseline

High bias model (underfitting)
- Model is too simple, cannot capture the true pattern.
- Adding more data usually doesn't help much, because the model lacks capacity to learn the pattern no matter how much data you feet it.
- Use more complex model or add better features
High variance model (overfitting)
- Model is too complex, it fits the training data too well and generalizes poorly.
- Adding more data often helps, because it gives the model more examples and reduces overfitting by averaging out noise.
- More data, stonger regularization or simpler model.

P82 Variance and Bias

Get more training examples - fixes high variance
Try smaller sets of features - fixes high variance
- Reduce flexibility of the model
Try getting additional features - fixes high bias
Try adding polynomial features - fixes high bias
Try decreasing lambda - high bias
Try increasing lambda - high variance

Lambda $\lambda$ controls how strongly the model penalizes large weights, in regularization terms:

L2: $J(\theta) = \text{Loss} + \lambda \sum_{i} \theta_{i}^{2}$
L1: $J(\theta) = \text{Loss} + \lambda \sum_{i} |\theta_{i}|$

P83 Trade Off

Simple model -> High bias Complex model -> High variance

Tradeoff between high bias and high variance.

Large neural networks are low bias machines. It just fits very complicated functions very well, so when training neural network, we often facing problems other than high bias problem if neural network is large enough.

Does it do well on training set? if not, use bigger network
Does it do well on the cross validation set? If not, use more training data.
A large neural network will usually do as well or better than a smaller one so long as regularization is chosen appropriately.

P84 ML Development

Interactive loop of ML development

Design → Train → Diagnose → Improve → Repeat

Choose architecture (model, data, etc.)
Train model. Monitor loss curves, convergence, training time, etc.
Diagnostics (bias, variance, error analysis)
Choose architecture and so on.

Text Classification. Features: list the top 10000 words to compute input variables.

Logistic model, or neural network model.

P85 Error analysis

Group the misclassifies in cross validation set based on common traits/features.

These groups are not mutually exclusive.

If the dataset is large, can get samples randomly from misclassifies to analysis.

How to try to reduce your spam classifier's error?

Collect more data
Develop sophisticated features based on email routing, from email header
Define sophisticated features from email body, some words are treated as the same world.
Design algorithms to detect misspellings.

P89 Data Augmentation

Data augmentation: Modifying an existing training example to create a new training example.

Examples:

Image recognition: Distortion a image, mirror, rotate, enlarge, shrink, change contrast.

Speech recognition example. Different of noisy background, bad cellphone connection.

Usually does not help to add purely random/meaningless noise to your data. It should be representation of the type of noise/distortions in the test set.

Data synthesis

Photo OCR example, generate synthesis data to train.

Conventional model-centric approach: AI = Code + Data; Works more on Code (algorithm/model)

Data-centric approach: AI = Code + Data; Focus on Data engineering.

P90 Transfer learning

Eliminate the last layer of the origin model, and switch it to the layer needed.

Use the parameter from the first layers from the origin model.

Option 1: only train output layers parameters.
Option 2: Train all parameters. The first layers are initialized with the origin model.

Two steps: Supervised pretraining, Fine tuning.

Convolution NN: firs layer detects edges, then corners, then curves/basic shapes.

Summary: 1. Download NN parameters pretrained. 2. Further train (fine tune) the network on your own data.

P88 MLOps

Define project: Scope the project.

Collect data, Define and collect data.

Train model: Training, error analysis, iterative improvement.

Deploy in production: Deploy, monitor and maintain system.

Implement ML model to a inference server, Mobile app make API call to server, server inferences to mobile app.

Software engineering may be needed for:

Ensure reliable and efficient predictions
Scaling
Logging
System monitoring
Model updates

MLOps: ML operations: Practice how to systematically build and deploy, maintain ML model.

Stages:

Model Development: Data collection, preprocessing, feature engineering, model training, validation
Model Deployment: Packaging the model e.g. Docker, ONNX, deploying it to production servers or edge devices (API, ROS node, etc.)
Monitoring & Maintenance: Tracking model performance, detecting data drift, retraining with new data, version control and roll back.

P89 Ethics, Pairness, bias, and other ethics.

P90 Skewed Datasets

Skewed or imbalanced datasets, the distribution of classes is uneven, some classes have many more samples than others.

Model bias: Model learns to predict the majority class most of the time because it minimizes overall error.
Misleading accuracy: Accuracy can look high even if the model ignores minority cases.
Poor generalization: Fail in critical minority situation e.g. fraud detection, medical diagnosis.

How to handle skewed datasets?

Dataa-level approaches:
1. Oversampling minority class
2. Undersampling majority class
3. Data augmentation
Algiorithm-level approaches:
1. Class weighting: Give higher penalty to misclassifying the minority class.
2. Custom loss function: use weighted cross-entropy, focal loss, etc.
Evaluation-level approaches:
1. Use metrics that reflect imbalance better: precision, recall, f1-score, confusion matrix.

Cannot know the best model based on its accuracy rate, because the dataset might be skewed.

Use confusion matrix is a better matrix. Precision/recall

Precision: $TP/predictedPositive = TP/(TP + FP)$
Recall: $TP/actualPostive = TP / (TP + FN )$

Metric	Meaning
Precision	Of all items predicted as positive, how many are actually correct
Recall (Sensitivity)	Of all actual positive items, how many did we correctly identify

P91 Trading off Precision and Recall ⭐

Rasing the logistic regression threshold will lead to higher precision, lower recall.
Lower the logistic regression will result in lower precision, higher recall.

Threshold	Precision	Recall	Meaning
Low (0.3)	↓ Low	↑ High	Catch more positives (many false alarms)
High (0.9)	↑ High	↓ Low	Only predict positives when very sure

F1 score, $F1 score = 1/[ 1/2 (1/P + 1/R) ] = 2 PR / (P+R)$; Harmonic mean of precision and recall.

P91 Decision Trees

Cat classification example

Input are categorical values (discrete)

Node, Topmost node is root node, Decision nodes in the middle of tree, Bottom node is leaf nodes for prediction.

P92

Decision 1: how to choose what feature to split on at each node?

Maximize purity (or minimize impurity). (Use the most important feature)

Decision 2: When do you stop splitting?

When a node is 100% one class
When splitting a node will result in the tree exceeding a maximum depth.
When improvements in purity score are below a threshold.
When number of examples in a node is below a threshold.

P93 Measuring purity

Entropy as a measure of impurity

Entropy is from 1 to 0, the lower the entropy, the higher the purity.

$p_0 = 1 - p_1$

$H(p_1)=-p_1\log_2(p_1)-p_0\log_2(p_0)$

Note: $0\log(0) = 0$

The peak of the function is 1 by making the log base is 2.

P94: Decision Tree learning, choose a split: information gain

Choose the feature with the lowest average weighted entropy.

Information gain.

P96: Putting it together

Start with all examples at the root node
Calculate information gain for all possible features, and pick the one with the highest information gain
Split dataset according to selected feature, and create left and right branches of the tree
Keep repeating splitting process until stopping criteria is met
1. When a node is 100% one class
2. When splitting a node will result in the tree exceeding a maximum depth
3. Information gain from additional splits is less than threshold
4. When number of examples in a node is below a threshold.

P97: Using one-hot encoding of categorical features

If a categorical feature can take on k values, create k binary features.

P98: Continuous features

P99: Regression tree

Choose the feature that split the data with lower variance.

Calculate the weighted variance of a feature to split on.

Use variance reduction as a measurement. Use the feature with largest variance reduction to split data.

P100: Tree ensembles

Trees are highly sensitive to small changes of the data. Using multiple trees and vote for the final result makes prediction more robust.

P101: Sampling with replacement

To construct a new traning set with a little bit similar but also pretty different from origin training set.

P102: Random forest algorithm

One powerful decision tree (Trees bag) algorithm.

Given training set of size $m$

For $b=1$ to $B$:

Use sampling with replacement to create a new training set of size $m$; Train a decision tree on the new dataset.

Randomizing the feature choice: At each node, when choosing a feature to use to split, if $n$ features are available, pick a random subset of $k<n$ features and allow the algorithm to only choose from that subset of features. (e.g. $k=sqrt(n)$)

P103: XGBoost decision tree

ensamble /ˌɑːnˈsɑːm.bəl/

It runs quickly, open source implementations are easily used.

Given training set of size $m$

For $b=1$ to $B$:

Use sampling with replacement to create a new training set of size $m$ But instead of picking from all examples with equal (1/m) probability, make it more likely to pick examples that the previously trained trees misclassify.

Train a decision tree on the new dataset.

XGBoost (extreme gradient boosting)

Open source implementation of boosted trees
Fast efficient implementation
Good choice of default splitting criteria and criteria for when to stop splitting
Built in regularization to prevent overfitting
Highly competitive algorithm for machine learning competitions e.g. Kaggle

Sample for classfication:

1from xgboost import XBGClassifier
2
3model = XGBClassifier()
4
5model.fit(x_train, y_train)
6y_pred = model.predict(x_test)

Regression:

1model = XGBegressor()
2model.fit(x_train, y_train)
3y_pred = model.predict(x_test)

P104 Decision Trees vs Neural Networks

Decision Trees and Tree ensembles

Work well on tabular (structured) data
Not recommended for unstructured data, images, audio, text
Fast
Small decision trees may be human interpretable

Neural network

Works well on all types of data, including tabular and unstructured data.
May be slower than a decision tree
Works with transfer learning
When building a system of multiple models working together, it might easier to string together multiple neural networks.

C3-Unsupervised Learning

P107 Clustering

Applications: Grouping similar news, Market segmentation, DNA analysis, Astronomical data analysis.

P108 K-means clustering

Cluster centroid.

Repeat until converged:

Step 1: Assign each point to its closest centroid.

Step 2: Recompute the centroids.

P109 K-means algorithms

Randomly initialize $K$ cluster centroids $\mu_1, \mu_2, ...,\mu_k$.

Repeat{

// Assign points to cluster centroids

for $i=1$ to $m$ training examples

$c^{(i)}:=$index (from 1 to K) of cluster centroid closest to $x^{(i)}$

// Move cluster centroids

for $k=1$ to K,

$\mu_k:=$ average of points assigned to cluster $k$

}

P110 Clustering, Optimization objective

$c^{(i)}=$ index of cluster to which example $x^{(i)} is currently assigned$

$\mu_k=$ cluster centroid $k$

$\mu_c(i)=$ cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Cost function (Distortion function)

$$J(c, \mu) = \frac{1}{m}\sum_{i=1:m}||x-\mu_{c}||$$

P111 Initilizing K-means

Randomly initialization

Choose $K<m$, clusters number is smaller than traing exampings.

Randomly pick K training examples, set centroids equal to these K examples.

To avoid local optima, can run K-means multiple times, and pick the clustering result that gave the lowest cost.

P112 Choossing the number of clusters.

Elbow method: Try K clusters and plot the cost function value, pick the point looks like elbow position.

More common method: Evaluate K-means based on a metric for how well it performs for that later purpose.

P113 Anomaly detection

Density estimation

Example:

Fraud detection, Model features of user;s activities from data. Identify unusual users by checking which have less probability.

P114 Gaussian Distribution

$x$ is a distributed Gaussian with mean $\mu$, $\sigma^2$ variance. $\sigma$ is standard deviation.

$p(x)=\frac{1}{\sqrt{2\pi}\sigma} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$

$\mu=\frac{1}{m}\sum x_i$

$\sigma^2=\frac{1}{m}\sum(x_i-\mu)^2$

P115 Algorithm

Training set: ${x^{(1)},x^{(2)},...,x^{(m)}}$ Each example $x_i$ has $n$ features.

$p(x)=\Pi_{j=1}^{n} p(x_j;\mu_j,\sigma_j^2)$

Choose $n$ features $x_i$ that you think might be indicative of anomalous examples.
Fit parameters $\mu_1,...,\mu_n,\sigma_1^2,...\sigma_n^2$

$\mu_j=\frac{1}{m}\sum x_j^{(i)}$

$\sigma_j^2=\frac{1}{m}(x_j^{(i)}-\mu_j)^2$
Given a new example $x$, compute

$p(x)=\Pi_{j=1}^np(x_j;\mu_j,\sigma_j^2)$

Anomaly if $p(x)<\epsilon$

P116 Developing and evaluation an anomaly detection system

Have some labeled data.

Training set is unlabeled dataset, assume those are normal (or anomalous).

Cross validation set.

Test set.

Cross validation and test sets include a few anomalous examples.

Aircraft engines monitoring example

10000 good engines, 20 flawed engines

Training set: 6000 good engines.

CV: 2000 good engines, 10 anomalous

Test: 2000 good engines, 10 anomalous

Train the algorithm using training set, verify the anomly detection performance on CV. Tuning $\epsilon$ on CV. Test result on Test set.

Alternative:

Training set: 6000 good engines; CV: 4000 good engines, 20 anomalous; No test set.

Anomaly Detection VS. Supervised Learning

Anomaly detection

Very small number of positive examples. (0-20 is common), Large number of negative examples
Many different types of anomalies. hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far.
Fraud detection
Manufacturing - finding new previously unseen defects.
Monitoring machines in a data center.

Supervised learning

Large number of positive and negative examples.
Enough positive examples for algorithm to get a sense of what positive examples are like, future positve example likely to be similar to ones in training set.
Email spam classification.
Manufacturing - Finding known, previously seen defects.
Weather prediction.
Deseases classification.

P118 Choosing what features to use

Non-gaussian features

Transform feature to more like Gaussian distribution.

Log, Square, Sqrt, etc.

Error analysis for anomaly detection

Choose features that might take on unusually large or small values in the event of an anomaly.

P120 Recommended System