The Fundamentals of Deep Learning, Part 1

Overview

A Knowledge collection of Deep Learning

Basic General Concepts

  • Types of machine learning:
    • Supervised Learning: Classification, Regression
    • Unsupervised Learning: Clustering, Dimensionality Reduction
    • Reinforcement Learning
  • Introduce the common used algorithms in Supervised Learning.
    • Linear Regression. Predict a continous target variable, assumes a linear relationship between the input features and the target variable.
    • Logistic Regression. For binary classification tasks. It uses the logistic function to constrain the output between 0 and 1.
    • Decision Trees. For both classfication and regression tasks. Split data into subsets based on the value of input features, it uses criteria like Gini impurity or entropy for classfication, and MSE for regression.
    • Random Forest: An ensemble method uses multiple decision trees to improve predictive performance. Each tree is trained on a random subset of the data, and the final prediction is made by averaging the results for regression or majority voting for classfication.
    • Support Vector Machines. Used for classification and regression tasks. Finds the hyperplane that best separates the classes in the feature space, uses different kernel functions (linear, polynomial, RBF) to handle non-linear classification.
    • k-Nearest Neighbours. For both classification and regression.
    • Naive Bayes. A probabilistic classifier based on Bayes' theorem.
  • Introduce algorithms in Unsupervised Learning.
    • K-Means Clustering. Partitions the data into k clusters based on the distance to the cluster centroids.
    • Hierachical Clustering.
    • Principal Componentj Analysis. Dimensionality reduction. Transforms the data into a new coordinate system with axes that maximize variance.

Gradient Descent Algorithm

  • Introduce Gradient Descent.

    Gradient descent is an optimization algorithm to minimize the objective function (cost/loss function) of a model in machine learning. It is an iterative process that adjusts parameters weights and biases to reduce the prediction error. It is used in machine learning algorithms like linear regression, Logistic regression, Neural networks, SVMs.

  • What is objective Function?

    In ML domain, Objective function, cost/loss function are used interchangable, measuring the error between predicted values and actual values.

    Examples:

    • Mean Squared Error for regression problems: $J=\frac{1}{n}\sum(y_i - \hat{y}_i)^2$
    • Cross-Entropy Loss for classification problems: $J=-\frac{1}{n}\sum[y_i \log{(\hat{y}_i)} + (1-y_i)\log(1-\hat{y}_i)]$
    • Hinge Loss for SVMs: $J = \frac{1}{n}\sum \max(0, 1-y_i*\hat{y}_i)$
  • What is the Gradient?

    Gradient is the partial derivatives of the cost function w.r.t each parameter. It points in the direction of the steepest ascent of the cost function.

  • Introduce the steps of Gradient Descent Algorithm.

    • Initilize weights and biases with random values
    • Calculate the gradient of the cost function w.r.t. each parameters
    • Adjust the parameters in the opposite direction of the gradient
  • Write the formula in the gradient descent algorithm for regression problems.

    • Define cost function: $J(\theta)=\frac{1}{2m}\sum(h_\theta(x^{(i)}-y^{(i)}))^2$
    • Compute the Gradient: $\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}\sum (h_\theta(x^{(i)}-y^{(i)}))x^{(i)}_j$
    • Update the parameters: $\theta_j := \theta_j-\alpha \frac{\partial J(\theta)}{\partial \theta_j}$

Data Engineering

  • What is data engineering.

    • Data engineering refers to preprocess data for a better usage by a machine learning model, by using intuition to design new features, by transforming or combining original features.
  • Intro some data engineering methods.

    • Data cleaning: Identifying and correcting errors or inconsistencies.
    • Data intergration: Combining data from multiple sources
    • Data normalization: Scale numerical data to a standard range, to ensure the features contribuite equally to the model's performance. Methods include mean normalization and z-score normalization.
  • List the normalization formula.

    • Mean normalization: $x_i:=\dfrac{x_i - \mu_i}{\max-\min}$
    • z-score normalization: $x_i:=\dfrac{x_i - \mu_i}{\sigma_i}$, $\sigma$ is the standard deviation.

Training Technique

  • What is cross validation.
    • Statistical method to evaluate the performance of a model. It splits data into subsets, training model on some subsets while validating it on the remaining subsets. The main aim is to get estimation on an independent dataset, help to prevent overfitting.
  • What is K-Fold Cross-Validation.
    • Split dataset into $k$ subsets, train model $k$ times, each time using a different fold as the test set and the remaining $k-1$ folds as the trainign set. Average the results of the $k$ evaluations.
  • Introduce one commom workflow of a model training process.
    • Split dataset into training and test sets.
    • Apply k-fold cross-validation on the training set.
    • Train the final model on the entire training set.
    • Evaluate the final model on the test set.

Evaluate a model using cross validation data during training, and use testing set to estimate generalization error.

Bias and Variance

  • What is Bias and Variance of a model.
    • Bias is error due to overly simplistic/underfit model.
    • Variance is error due to sensitivity to small fluctuations in training set. Model is overfitting.
    • Underfiting leads to high bias, overfiting leads to high variance.
  • How to address overfitting (high variance).
    • Collect more training data.
    • Feature selection, select the most relavant features.
    • Using all features and insufficient data leads overfitting.
    • Use regularization term and increase its lambda for cost function during training.
  • How to address underfit (high bias).
    • Getting additional features.
    • Try adding polynomial features for regression task.
    • Try decreasing lambda for cost term.

Prevent Overfitting

Large neural network are low bias machine, it fits very complicated function very well. So when training NN model, we often feacing overfitting problems other than underfit.

  • What is regularization?
    • Regularization is used to prevent overfitting and improve the generalization of model.
    • Regularization is a penalty term to the weights of network, encouraging the model to keep the weights small.
    • One cost function with regularization term: $J(w,b)=\dfrac{1}{2m}\sum(f(x)-y)^2 + \dfrac{\lambda}{2m}\sum w^2, \lambda > 0$

A large neural network will usually do as well or better than a smaller one so long as regularization is chosen appropriately.

  • What is dropout?

    • Randomly select neurons are ignored during training. No weight updates are applied to the neuron. It helps prevent the model from becomming too reliant on any particular neuron, thereby improving generalization.
  • Introduce other technique preventing model overfitting and increase generalization ability.

    • Data augmentation: Creating new training samples from existing data, by random rotaation, translation, flipping and scaling.
    • Early stopping: Monitor the model's performance on a validation set and stopping the training when performance starts to deteriorate.
    • Batch normalization: Normalize the input of each layer so the mean output activation is close to 0 and standard deviation is close to 1. It helps to stablize the learning process and reduce the sensitivity to the initial starting weights.

Evaluation Metrics

  • Introduce evaluation metrics for classification.

    • Accuracy: Proportion of correctly classified instances out of the total instances. $(TP + TN)/(P+N)$
    • Precision: Proportion of true positive predictions out of all positive predictions. $(TP/(TP+FP))$
    • Recall (Sensitivity or True positive rate): Proportion of true positive predicitons out of all actual positives. $TP/(TP+FN)$
    • F1 score. The harmonic mean of precision and recall. $2(Precision*Recall)/(Precision+Recall)$
  • Metrics for regression tasks.

    • Mean absolute error, mean squared error, root mean squared error.
  • List examples to use different metrics.

    • For imbalanced dataset and rare event detection like fraud detection, use precision and recall, high recision ensures minimizing false positives, high recall ensures minimizing false negative. F1 Score is often necessary.
    • For balanced dataset, use accuracy. It is straightforward and effective metric.
    • For Medical diagnosis, missing a positive case is much worse than incorrectly flagging a negative case, such as cancer detection, recall is crucial; If the treatment has significant side effects, high precision can avoid unnecessary treatments.
    • For recommendation systems, users care more about the relevace of the top results, precisono is key.

Decision Trees

  • Key terms of a decision tree.

    • Root node is the topmost node.
    • Decision nodes are in the middle of a tree.
    • Leaf node are bottom node for prediction
    • branches are edges connect to nodes.
  • List the splitting criteria of decision trees.

    • Gini impurity, Entropy or Information gain, Variance reduction
  • Introduce the prons and cons of decision trees.

    • Prons: Simpole to understand and interpret; Little data preparation required; Handles both numerical and categorical data.
    • Cons: Trees are easy to be overfitted; Prediction result can be unstable if a small changes in the input data; Biased if some classses dominate.

Trees are highly sensitive to small changes (noise) of the data, Using multiple trees and vote for the final result makes prediction more robust.

  • Introduce tree ensemble method Random forest

    • It builds multiple decision trees during training and combines results via averaging for regression or voting for classification.
    • Each tree is trained on a random subset of the training data (sampling with replacement) and a random subset of features. It reduces overfitting and increases the generalization ability.
  • Introduce tree ensemble method gradient boosting

    • Instead of picking from all examples with equal probability like random forest, make it more likely to pick examples that the previously trained trees misclassify. (sequentially trees
    • Gradient Boosting is known for its high predictive accuracy and ability to capture complex relationships in the data.
    • XGBoosot, extreme gradient boosting, has built-in regularization techniques to prevent overfitting, it helps to control the complexity of the model. It knows the good choice of default splitting criteria and criteria for when to stop splitting. Highly competitive algorithm for machine learning competitions like Kaggle.
comments powered by Disqus