Deep Learning Review Session

2 minute read

NOTE: This blog is not finished.

Acknowledgements: The content of this blog is adapted from the Deep Learning Review Session (of IIIS, Tsinghua University) at 2025, hosted by me and Yiyang Lu.

1. Deep Learning Basics

1.1 Write in the Front

  • Deep Learning is a class of machine learning methods that use neural networks to learn representations from raw data.

  • A DL algorithm always consists of neural network (model) structure, training procedure (loss function) and inference procedure.

  • Always use backpropagation.

  • Always use mini-batch of data to compute gradient, and then do gradient descent on each dimension; the optimizer can vary, such as SGD, RMSprop or Adam.

  • Convention: if a function $f$ is a neural network and its parameters are $\theta$, we write $f_\theta(\cdot)$ or $f(\cdot;\theta)$ and say $f$ is parameterized by $\theta$. The loss function is $\mathcal L$, the dataset is $\mathcal D$.

1.2 What Kind of Loss Function to Use?

The loss function should always be differentiable.

We always prefer tractable loss. If the loss is not tractable, we have to estimate the gradient (using Monte-Carlo), which introduces instability and error.

What is tractable?

  • $\mathbb E_{x\sim \mathcal D}$ and $\mathbb E_{z\sim \mathcal N(0,I)}$ are both tractable, since we can sample a mini-batch from dataset $\mathcal D$ and Gaussian $\mathcal N$ (or other known distribution) during training.

  • The direct output of the model.

CAVEAT: If the model inputs $z$ and outputs $x$, then $p_\theta(x|z)$ is tractable, but $p_\theta(z|x)$ is often not! This is what “direct output” means.

1.3 Optimization Methods

  • Gradient Descent (GD): $\theta\leftarrow \theta-\eta\cdot \nabla_\theta \mathcal L$.

  • Stochastic GD (SGD) (in DL scenario): sample a small batch of data, then compute average loss as the estimation of the true $\mathcal L$.

  • Momentum: $\theta_{t+1}\leftarrow \theta_t-\eta\cdot \nabla_{\theta_t} \mathcal L(\theta_t)+\beta\cdot (\theta_t-\theta_{t-1})$, where $\theta_t$ is the parameters at step $t$. No convergence guarantee.

  • Nesterov’s Method: $\theta_{t+1}\leftarrow \theta_t-\eta\cdot \nabla_{\theta_t} \mathcal L(\color{red}\theta_t+\beta\cdot (\theta_t-\theta_{t-1})\color{none})+\beta\cdot (\theta_t-\theta_{t-1})$. Faster convergence guarantee in smooth functions.

  • AdaGrad: $\theta_{t+1}[i]=\theta_t[i]-\dfrac{\eta}{\sqrt{G_t[i]+\epsilon}}(\nabla_{\theta_t}\mathcal L)[i]$, $G_t[i]=\sum_{s\le t} |(\nabla_{\theta_s}\mathcal L)[i]|^2$.

  • RMSProp: Change $G_t[i]=\gamma \cdot G_{t-1}[i]+(1-\gamma)|(\nabla_{\theta_t}\mathcal L)[i]|^2$ (“moving average”)

  • Adam: Change the gradient term $\nabla_{\theta_t}\mathcal L$ in RMSProp into the momentum version

    \[M_t=\delta M_{t-1}+(1-\delta)\nabla_{\theta_t}\mathcal L;\]

    compute bias-corrected first moment and second raw moment estimate

    \[\hat{G}_t = \frac{G_t}{1-\delta^t},\quad \hat{M}_t = \frac{M_t}{1-\gamma^t};\]

    and the update rule is

    \[\theta_{t+1}[i]=\theta_t[i]-\dfrac{\eta}{\sqrt{\hat{G}_t[i]+\epsilon}}\hat{M}_t[i].\]

RMSProp and Adam are common choices in modern DL.

Essentially, AdaGrad and RMSProp are making the learning rate in each dimension different (instead of a unified constant $\eta$). For example, in AdaGrad, the learning rate of dimension $i$ is $\dfrac{\eta}{\sqrt{G_t[i]+\epsilon}}$.

Practices

  • Why SGD but not GD?

    Computational affordable; avoid saddle points.

  • What is the benefit of RMSProp compared to AdaGrad?

    It can avoid vanishing learning rate.

  • T/F: Adam, AdaGrad are making the learning rate different and disentangled in different dimension.

    True.

  • T/F: Although Nesterov Momentum has no convergence guarantee, it is commonly used in real training.

    False. None of them is true.

  • T/F: SGD and GD have the same mean and variance in training.

    False.

  • T/F: Second-order optimization is more stable than gradient descent.

    False. It can diverge for non-convex functions due to negative eigenvalues.

1.4 Model Architecture: Multi-layer Perceptron (MLP)

Definition: alternating between linear layer and activation layer.

  • Linear layer: $x_{\text{output}}=Wx_{\text{input}}+b$, where $W$ and $b$ are learnable.
  • Activation layer: a non-linear layer applied to each neuron.
    • ReLU: $f(x)=\max{x,0}$;
    • LeakyReLU: $f(x)=\max{x,kx}$ ($0<k<1$);
    • Sigmoid: $f(x)=\dfrac{1}{1+e^{-x}}$ and $f’(x)=f(x)(1-f(x))$;
    • Tanh: $f(x)=\dfrac{e^x-e^{-x}}{e^{x}+e^{-x}}$ and $f’(x)=1-f(x)^2$.