The Fundamental of Deep Learning, Part 2

Overview

A Knowledge collection of Deep Learning

Activation Function

  • The purpose of an activation function is to introduce non-linearity into a model, allowing the network to learn and represent complex patterns in the data. List the common activation functions.
    • Linear function: $y=x$, is used at just one place, output layer.
    • Sigmoid function, S shape graph, $y = 1 / (1 + e^(-x))$. It is non-linear, when X values lies between -2 and 2, Y values are very steep. Y value range is 0 to 1. Usually used in output layer of a binary classification, result in 0 or 1.
    • Tanh function, works better than sigmoid function, a.k.a. Taggent hyperbolic function, it's mathematically shifted version of the sigmoid function. $y = tanh(x) = 2/(1+e*(-2x))-1$ or $y = 2*sigmoid(2x)-1$. Value range is -1 to 1. Usually used in hidden layers of a networks, it helps to center the data by bringing mean close to 0.
    • ReLU function. Rectified linear unit, most widely used one, chiefly used in hidden layers. $y = max(0,x)$. ReLU is less computationally expensive than tanh and sigmoid since it involves simpler math operations. ReLU learns much faster than sigmoid and Tanh.
    • Softmax is also a type of sigmoid function, but it is handy for multiclass classification problems. Usualy used in the output layer of image classfication problems. It is ideally used in the output layer to output probabilities to define the class of each input.

Basic rules: If don't know which one to use, then simply use ReLU. For output layer, binary classification use sigmoid function, multiclass classification use softmax.

Convolutional Neural Networks

CNN is specifically designed to process structured grid data such as images. It learns spatial hierarchies of features so it is effective for tasks like image classfication, object detection, and semantic segmentation.

  • How does the convolutionaly layer work in a CNN?

    • It applies a set of filters (kernels) to the input data, each filter slides over the input data, computing dot products. It produces a feature map, which highlights the presence of specific features like edges or textures. The convolution operation is followed by a non-linear activation function.
  • What are the main components of a CNN?

    • Input layer: Holds the raw pixel values of the image. Input dimensions typically correspond to the height, width, and color channels.
    • Convolutional layer. Important params include the number of filters, filter size, stride, and padding.
    • Pooling layer: Reduces the spatial dimensions of the feature maps while retaining the most important information. Common types include Max pooling and average pooling.
    • Output layer: produces the output probabilities for each class using activation function like Softmax for multi-class clasification or Sigmoid for binary classification.
  • Introduce some famous CNN networks:

    • AlexNet: 2012, layers include 5 conolutional layers, some followed by max-pooling layers, three fully connected layers. Activation uses ReLU.
    • VGGNet: 2014. Series of convolutional layers with small receptive fields (3x3), max-pooling layers, and three fully connected layers. Variants VGG16 and VGG19. It showed that the depth of the network is a critical component for high performance.

Hyperparameters

  • Brief introduce the key hyperparameters in a neural network training session.
    • Learning Rate: Control how much to adjust the wieghts in response to the error each time the model is updated. Higher learning rate means bigger steps which can speed up training but might overshoot the optimal solution. Lower rate means smaller steps and precise convergenxe.
    • Batch Size: Number of training examples used in one forwaard/backward pass. Larger batch size provides more accurate estimate of the gradient, but reuires more memory. Smaller can lead to noisier estimates but might help generalization.
    • Epochs: Number of times the entire training dataset passes through the neural networks. More epochs allow learning better but lead to overfitting.
    • Dropout Rate: Regularization technique randomly ignore neurons during training. Determines the proportion of neurons to be dropped out. Typically dropping 20% nodes.
    • Learning Rate Decay: Reduce the learning rate as training progresses, helping converge mroe smoothly towards the end of training.

Others

  • What is data normalization?
    • Standardizing and reforming data, it's a preprocessing step to eliminate data redundancy. Rescale values to fit into a particular range, achieving better convergence.
  • What is the difference between a feedforward neural network and recurrent neural network?
    • Feedforward network signals travel in one direction from input to output, no any feedback loops between layers. It cannot memorize previous inputs.
    • A recurrent neural network's signals travel in both directions, creating a looped network. It considers the current input with the previously received inputs for generating the output of a layer and can memorize past data due to its internal memory.
  • What is batch normalization?
    • It is technique to improve the performance and stability of neural networks by normalizng the inputs in every layer so that they have mean output activation of zero aand standard deviation of one.
  • What is long short term memory network?
    • It's a special kind of RNN capable of learning long-term dependencies, remembering information for long periods as its default behaviou.

Program

Write a simple neural network with PyTorch, include training steps and verification by new input data.

source code

comments powered by Disqus