Note of NG's Machine Learning Course

Overview

Based on NG's Machine Learning course (WIP).

C1-Supervised Machine Learning - Regression and Classification

P25 Feature scaling

Mean normalization: $x_i:=\dfrac{x_i - \mu_i}{max-min}$

z-score normalization $x_i:=\dfrac{x_i - \mu_i}{\sigma_i}$

All features will have a mean of 0 and a standard deviation of 1. $\sigma is the standard deviation$

The scaled features get very accurate results much, much faster.

P29 feature engineering
P30 Multiple linear regression

Feature engineering: Using intuition to design new features, by transforming or combining original features.

quadratic function: 二次方程 cubic function: 三次方程

When doing feature enginerring, the feature scaling becomes increasingly important.

Gradient descent is picking the 'correct' features for us by emphasizing its associated parameter.

P31 Motivation

Binary classification, negative/positove class

P32 Logistic regression

The logistic regression is used for classification, other than regression even it has the word regression.

tumor: 肿瘤 malignant 恶性的

logistic function is sigmoid function

$g(z)=\dfrac{1}{1+e^{(-z)}}$, output between 0 and 1

Logistic regression model: $f_{w,b}{x}=g(w\cdot x+b)=\dfrac{1}{1+e^{-(w\cdot x + b)}}$

$f_{w,b}(x)=P(y=1|x;w,b)$ means: Probability that y is 1, given input x, parameters w,b

P33 Decision Boundary

Decision boundary: $z=w\cdot x + b=0$

non-linear decision boundaries,

P34 Cost function for logistic regression

Loss $L(f_{w,b}(x),y)$

If y = 1 $= -\log(f_{w,b}(x))$

If y = 0 $= -\log(1-f_{w,b}(x))$

P35 Simplified loss function

Loss function:

$L(f_{w,b}(x,y))=-y\log(f_{w,b}(x))-(1-y)\log(1-f_{w,b}(x))$

Cost function:

$J(w,b)=\dfrac{1}{m}\sum[L(f_{w,b}(x),y)]=-\dfrac{1}{m}\sum[y\log(f_{w,b}(x))-(1-y)\log(1-f_{w,b}(x))]$

Loss is the cost for signle data point.

Why choose this as the cost function, it derives from statistics using maximum likelihood estimation, which is an idea from statistics on how to efficiently find parameters from different models.

P37 Overfitting

Underfit, high bias

Overfit, high variance

P38 Addressing Overfitting

Option 1: More training data, collect more data

Option 2: Feature selection. Decrease data features, select the most relavant features.

All features + insufficient data = overfit

Option 3: Regularization

Keep all features, but use small values for w, to decrease the effect of each parameter. Reduce size of parameters

P39 Regularization

Add regularization term for cost function:

$J(w,b)=\dfrac{1}{2m}\sum(f(x)-y)^2 + \dfrac{\lambda}{2m}\sum w^2, \lambda > 0$

P40 Regularization for linear regression

C2-Advanced Learning Algorithms

P45 Requirement prediction

Inference (prediction)

activation; activation values;

$a = f(x) = \dfrac{1}{1+e^{wx+b}}$

$a^{[1]}$ denotes the output (activation value) of layer 1. input layer is layer 0.

P48

$a_j^{[l]}=g(w_j^{[l]}\cdot a^{[l-1]}+b_j^[l])$

g is the activation function

P51

Matrix 2x3 in Numpy x = np.array([[1,2,3], [4,5,6]]) Matrix is 2D array.

Vector is 1D array. x = np.array([200, 17]), just a list.

P53 Forward prop in a singile layer

P61

Binary classification problem, The logistic loss function, also known as binary cross entropy.

Compute derivatives for gradient descent using back propagation.

P62 Activation function

Sigmoid

Commen used one is ReLU (rectified linear unit) $g(z)=\max (0,z)$.

P63

Classification prolem, sigmoid function is nature choice for output layer.

Regression: Linear activation function for output layer

Or ReLU if for non-negative output values.

For hidden layer activation, ReLU is more common used. Sigmoid is rarely used today. ReLU is fast to compute, ReLU goes to flat only to the left, Sigmoid has two flat zone, it makes gradient descent fast.

P66 Multiclass, Softmax

Softmax is used for multiclass classification problem

Softmax regression, N possible outputs

$z_j = w_j \cdot x + b_j, j=1,...,N$

parameters w and b

activation function: $a_j=\dfrac{e^{z_j}}{\sum_{k=1}^{N}e^{z_k}}=\mathbf{P}(y=j|x)$

$a_1 + ... + a_N=1$

If use softmax for binary classification, it is same as sigmoid activation.

P67

For digit recognition prolem, use softmax as output layer, and use 10 neutron to form the output layer.

Loss function name in TF is SparseCategoricalCrossentropy

P68

TF, from_logits=True will make the round off more accurate when with softmax.

model.compile(loss=SparseCategoricalCrossEntropy(from_logits=True))

P69 Multi labels classification

Can use sigmoid activations function for the output layer.

P79 Advance Optimization

Adam (Adaptive moment estimation) algorithm, automatically adjusts the learning rate alpha. It not just one alpha, but each neutro has one learning rate.

If w or b keeps moving in same direction, increase alpha If keeps oscillating, reduce alpha.

In code, model compile select optimizer use Adam. It uses one default init learning rate.

It is better than gradient descent algorithm.

P71

Dense layer.

Convolutional layer, Each neuron only looks at part of the previous layer's inputs. looking a window.

  • Faster computation
  • Need less training data, less prone to overfitting.

LSTM, Transformer, attention models.

P75

Diagnostic: A test that you run to gain insight into what is/isn't working with a learning algorithm, to gain guidance into improving its performance.

P77 model selcection

Training set, cross validation data, test set.

Evaluate a model using cross validation data during training, and use testing set to estimate generalization error.

P78 Bias/Variance

high bias - underfit J_train is high; J_cv is high

high variance (overfit) J_train is low; J_cv is high

P80 Establishing a base line

high bias model,more training dataset helps less.

high variance model, more training dataset helps.

P82

  • Get more training examples - fixes high variance
  • Try smaller sets of features - fixes high variance
    • Reduce flexibility of the model
  • Try getting additional features - fixes high bias
  • Try adding polynomial features - fixes high bias
  • Try decreasing lambda - high bias
  • Try increasing lambda - high variance

P83

Simple model -> High bias Complex model -> High variance

Tradeoff between high bias and high variance.

Large neural networks are low bias machines. It just fits very compicated functions very well, so when training neural network, we offten faceing problems other than high bias problem if neural network is large enough.

Does it do well on training set? if not, use bigger network

Does it do well on the cross validation set? If not, use more training data.

A large neural network will usually do as well or better than a smaller one so long as regularization is chosen appropriately.

P84

Interative loop of ML development

Choose architecture (model, data, etc.), Train model, Diagnostics (bias, variance, error analysis), choose architecture and so on.

Text classfication. Features: list the top 10000 words to compute input variables.

Logistic model, or neural network model.

P85 Error analysis

Group the misclassifies in corss validation set based on common traits/features.

These groups are not mutually exclusive.

If the dataset is large, can get samples randomly from misclassifies to analysis.

How to try to reduce your spam classifier's error?

  1. Collect more data
  2. Develop sophisticated features based on email routing, from email header
  3. Define sophisticated features from email body, some words are treated as the same world.
  4. Design algorithms to detect misspellings.

P89

Data augmentation: Modifying an existing traning example to create a new training example.

Examples:

Image recognition: Distortion a image, mirro, rotate, enlarge, shrink, change contrast.

Speech recognition example. Different of noisy background, bad cellphone connection.

Usually does not help to add purely random/meaningless noise to your data. It should be representation of the type of noise/distortions in the test set.

Data synthesis

Photo OCR example, generate sysnthesis data to train.

Conventional model-centric approach: AI = Code + Data; Works more on Code (algorithm/model)

Data-centric approach: AI = Code + Data; Focus on Data engineering.

P90 Transfer learning

Eliminate the last layer of the origin model, and switch it to the layer needed.

Use the parameter from the first layers from the origin model.

Option 1: only train output layers parameters.

Option 2: Train all parameters. The first layers are initialized with the origin model.

Two steps: Supervised pretraining, Fine tuning.

Convolution NN: firs layer detects edges, then corners, then curvers/basic shapes.

Summary: 1. Download NN parameters pretrained. 2. Further train (fine tuen) the network on your own data.

P88 ML Project Development process

Define project: Scope the project.

Collect data, Define and collect data.

Train model: Training, error analysis, iterative improvement.

Deploy in production: Deploy, monitor and maintain system.

Imploement ML model to a inference server, Mobile app make API call to server, server inferences to mobile app.

Software engineering may be needed for:

  • Ensure reliable and efficient predictions
  • Scaling
  • Logging
  • System monitoring
  • Model updates

MLOps: ML operations: Practice how to systematically build and deploy, maintain ML model.

P89 Ethics, Pairness, bias, and other ethics.

P90 Skewed datasets

Cannot know the best model based on its accuracy rate, because the dataset might be skewed.

Use confusion matrix is a better matrix. Precision/recall

Precision: TP/#predicted positive = TP/(TP + FP)

Recall: TP/#actual postive = TP / (TP + FN )

P91 Trading off precision and recall

Rasing the logstic regression threshold will lead to higher precision, lower recall.

Lower the logistic regression will result in lower precision, higher recall.

F1 score, F1 score = 1/[ 1/2 (1/P + 1/R) ] = 2 PR / (P+R); Harmonic mean of P and R.

P91 Decision Trees

Cat classification example

Input are categorical values (discrete)

Node, Topmost node is root node, Decision nodes in the middle of tree, Bottom node is leaf nodes for prediction.

P92

Decision 1: how to choose what feature to split on at each node?

Maximize purity (or minimize impurity). (Use the most important feature)

Decision 2: When do you stop splitting?

  • When a node is 100% one class
  • When splitting a node will result in the tree exceeding a maximum depth.
  • When improvements in purity score are below a threshold.
  • When number of examples in a node is below a threshold.

P93 Measuring purity

Entropy as a measure of impurity

Entropy is from 1 to 0, the lower the entropy, the higher the purity.

$p_0 = 1 - p_1$

$H(p_1)=-p_1\log_2(p_1)-p_0\log_2(p_0)$

Note: $0\log(0) = 0$

The peak of the function is 1 by making the log base is 2.

P94: Decision Tree learning, choose a split: information gain

Choose the feature with the lowest average weighted entropy.

Information gain.

P96: Putting it together

  1. Start with all examples at the root node
  2. Calculate information gain for all possible features, and pick the one with the highest information gain
  3. Split dataset according to selected feature, and create left and right branches of the tree
  4. Keep repeating splitting process until stopping criteria is met
    1. When a node is 100% one class
    2. When splitting a node will result in the tree exceeding a maximum depth
    3. Information gain from additional splits is less than threshold
    4. When number of examples in a node is below a threshold.

P97: Using one-hot encoding of categorical features

If a categorical feature can take on k values, create k binary features.

P98: Continuous features

P99: Regression tree

Choose the feature that split the data with lower variance.

Calculate the weighted variance of a feature to split on.

Use variance reduction as a measurement. Use the feature with largest variance reduction to split data.

P100: Tree ensembles

Trees are highly sensitive to small changes of the data. Using multiple trees and vote for the final result makes prediction more robust.

P101: Sampling with replacement

To construct a new traning set with a little bit similar but also pretty different from origin training set.

P102: Random forest algorithm

One powerful decision tree (Trees bag) algorithm.

Given training set of size $m$

For $b=1$ to $B$:

Use sampling with replacement to create a new training set of size $m$; Train a decision tree on the new dataset.

Randomizing the feature choice: At each node, when choosing a feature to use to split, if $n$ features are available, pick a random subset of $k<n$ features and allow the algorithm to only choose from that subset of features. (e.g. $k=sqrt(n)$)

P103: XGBoost decision tree

ensamble /ˌɑːnˈsɑːm.bəl/

It runs quickly, open source implementations are easily used.

Given training set of size $m$

For $b=1$ to $B$:

Use sampling with replacement to create a new training set of size $m$ But instead of picking from all examples with equal (1/m) probability, make it more likely to pick examples that the previously trained trees misclassify.

Train a decision tree on the new dataset.

XGBoost (extreme gradient boosting)

  • Open source implementation of boosted trees
  • Fast efficient implementation
  • Good choice of default splitting criteria and criteria for when to stop splitting
  • Built in regularization to prevent overfitting
  • Highly competitive algorithm for machine learning competitions e.g. Kaggle

Sample for classfication:

1from xgboost import XBGClassifier
2
3model = XGBClassifier()
4
5model.fit(x_train, y_train)
6y_pred = model.predict(x_test)

Regression:

1model = XGBegressor()
2model.fit(x_train, y_train)
3y_pred = model.predict(x_test)

P104 Decision Trees vs Neural Networks

Decision Trees and Tree ensembles

  • Work well on tabular (structured) data
  • Not recommended for unstructured data, images, audio, text
  • Fast
  • Small decision trees may be human interpretable

Neural network

  • Works well on all types of data, including tabular and unstructured data.
  • May be slower than a decision tree
  • Works with transfer learning
  • When building a system of multiple models working together, it might easier to string together multiple neural networks.

C3-Unsupervised Learning

P107 Clustering

Applications: Grouping similar news, Market segmentation, DNA analysis, Astronomical data analysis.

P108 K-means clustering

Cluster centroid.

Repeat until converged:

Step 1: Assign each point to its closest centroid.

Step 2: Recompute the centroids.

P109 K-means algorithms

Randomly initialize $K$ cluster centroids $\mu_1, \mu_2, ...,\mu_k$.

Repeat{

// Assign points to cluster centroids

for $i=1$ to $m$ training examples

$c^{(i)}:=$index (from 1 to K) of cluster centroid closest to $x^{(i)}$

// Move cluster centroids

for $k=1$ to K,

$\mu_k:=$ average of points assigned to cluster $k$

}

P110 Clustering, Optimazation objective

$c^{(i)}=$ index of cluster to which example $x^{(i)} is currently assigned$

$\mu_k=$ cluster centroid $k$

$\mu_c(i)=$ cluster centroid of cluster to which example $x^{(i)}$ has been assigned

Cost function (Distortion function)

$$J(c, \mu) = \frac{1}{m}\sum_{i=1:m}||x-\mu_{c}||$$

P111 Initilizing K-means

Randomly initialization

Choose $K<m$, clusters number is smaller than traing exampings.

Randomly pick K training examples, set centroids equal to these K examples.

To avoid local optima, can run K-means multiple times, and pick the clustering result that gave the lowest cost.

P112 Choossing the number of clusters.

Elbow method: Try K clusters and plot the cost function value, pick the point looks like elbow position.

More common method: Evaluate K-means based on a metric for how well it performs for that later purpose.

P113 Anomaly detection

Density estimation

Example:

Fraud detection, Model features of user;s activities from data. Identify unusual users by checking which have less probability.

P114 Gaussian Distribution

$x$ is a distributed Gaussian with mean $\mu$, $\sigma^2$ variance. $\sigma$ is standard deviation.

$p(x)=\frac{1}{\sqrt{2\pi}\sigma} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$

$\mu=\frac{1}{m}\sum x_i$

$\sigma^2=\frac{1}{m}\sum(x_i-\mu)^2$

P115 Algorithm

Training set: ${x^{(1)},x^{(2)},...,x^{(m)}}$ Each example $x_i$ has $n$ features.

$p(x)=\Pi_{j=1}^{n} p(x_j;\mu_j,\sigma_j^2)$

  1. Choose $n$ features $x_i$ that you think might be indicative of anomalous examples.

  2. Fit parameters $\mu_1,...,\mu_n,\sigma_1^2,...\sigma_n^2$

    $\mu_j=\frac{1}{m}\sum x_j^{(i)}$

    $\sigma_j^2=\frac{1}{m}(x_j^{(i)}-\mu_j)^2$

  3. Given a new example $x$, compute

    $p(x)=\Pi_{j=1}^np(x_j;\mu_j,\sigma_j^2)$

    Anomaly if $p(x)<\epsilon$

P116 Developing and evaluatiing an anomly detection system

Have some labled data.

Training set is unlabled dataset, assume those are normal (or anomalous).

Cross validation set.

Test set.

Cross validation and test sets include a few anomalous examples.

Aircraft engines monitoring example

10000 good engines, 20 flawed engines

Training set: 6000 good engines.

CV: 2000 good engines, 10 anomalous

Test: 2000 good engines, 10 anomalous

Train the algorithm using training set, verify the anomly detection performance on CV. Tuning $\epsilon$ on CV. Test result on Test set.

Alternative:

Training set: 6000 good engines; CV: 4000 good engines, 20 anomalous; No test set.

Anomaly Detection VS. Supervised Learning

Anomaly detection

  • Very small number of positive examples. (0-20 is common), Large number of negative examples

  • Many different types of anomalies. hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far.

  • Fraud detection

  • Manufacturing - finding new previously unseen defects.

  • Monitoring machines in a data center.

Supervised learning

  • Large number of positive and negative examples.

  • Enough positive examples for algorithm to get a sense of what positive examples are like, future positve example likely to be similar to ones in training set.

  • Email spam classification.

  • Manufacturing - Finding known, previously seen defects.

  • Weather prediction.

  • Deseases classification.

P118 Choosing what features to use

Non-gaussian features

Transform feature to more like Gaussian distribution.

Log, Square, Sqrt, etc.

Error analysis for anomaly detection

Choose features that might take on unusually large or small values in the event of an anomaly.

P120 Recommended System

comments powered by Disqus