Demystifying Neural Networks (Part 3): Backpropagation – The Learning Tool of Neural Networks|揭秘神经网络(三):反向传播算法—神经网络的学习利器

In the previous blog post, we learned about the basic concepts of cost functions and gradient descent.

Today, we will delve into the core of neural network training - the backpropagation algorithm, and answer some common questions.

1. What is Backpropagation?


Backpropagation is an efficient algorithm used to calculate the gradient of the cost function with respect to each parameter (weight and bias) in a neural network.

A gradient is a vector that represents the direction of the fastest change of a function at a certain point.

In neural networks, the gradient indicates the degree of influence of changing the weight or bias on the cost function.

Through these gradients, we can use optimization algorithms such as gradient descent to update parameters, thereby minimizing the cost function and improving the model's prediction accuracy.

Backpropagation vs. Forward Propagation:


Forward Propagation: Input data is passed from the input layer to the output layer, layer by layer, to obtain the model's predicted output.
前向传播 (Forward Propagation): 输入数据从输入层开始,逐层传递到输出层,得到模型的预测结果。

Backpropagation: Starting from the output layer, the gradient of each parameter is calculated layer by layer, and the gradient information is passed back to the previous layer, until the input layer.
反向传播 (Backpropagation): 从输出层开始,逐层计算每个参数的梯度,并将梯度信息传递回上一层,直到输入层。

2. How Backpropagation Works


The core of the backpropagation algorithm is the Chain Rule, which allows us to calculate the derivative of composite functions.
反向传播算法的核心是链式法则(Chain Rule),它允许我们计算复合函数的导数。

In neural networks, the output of each layer is a function of the output of the previous layer, so we can use the chain rule to calculate the effect of each parameter on the final output (cost function).

Mathematical Principles


Assume we have a simple two-layer neural network, where w is the weight, b is the bias, a is the activation function, and L is the loss function.

Forward propagation can be represented as: y = a(w · x + b)
前向传播可以表示为: y = a(w · x + b)

L = (y - y_true)^2

Using the chain rule, we can calculate the partial derivative of the loss function with respect to the weight:

∂L/∂w = ∂L/∂y · ∂y/∂w = 2(y - y_true) · a'(w · x + b) · x

Specific Steps of Backpropagation:


  1. Forward Propagation: Calculate the output of each neuron.
    • 前向传播:计算每个神经元的输出。
  2. Calculate Output Layer Error: Compare the predicted output of the model with the true value and calculate the error.
    • 计算输出层误差:将模型的预测输出与真实值进行比较,计算误差。
  3. Backpropagate Error: Starting from the output layer, calculate the contribution (gradient) of each parameter to the error layer by layer.
    • 反向传播误差:从输出层开始,逐层计算每个参数对误差的贡献(梯度)。
  4. Update Parameters: Update the parameters using optimization algorithms such as gradient descent, based on the magnitude and direction of the gradient.
    • 更新参数:根据梯度的大小和方向,使用梯度下降等优化算法更新参数。

3. Optimization Algorithms


Stochastic Gradient Descent (SGD)

随机梯度下降 (Stochastic Gradient Descent, SGD)

Stochastic Gradient Descent is a commonly used optimization algorithm to speed up the training process of backpropagation.

Unlike traditional Batch Gradient Descent, Stochastic Gradient Descent uses only one training sample to update parameters at a time.

This can greatly speed up the training process, especially on large datasets.

Mini-Batch Gradient Descent

小批量梯度下降 (Mini-Batch Gradient Descent)

Mini-Batch Gradient Descent is a variant of Stochastic Gradient Descent that uses a small batch of training samples to update parameters at a time.

This can achieve a better balance between training speed and stability.

Other Optimization Algorithms


Apart from SGD, there are many other optimization algorithms widely used:

  • Adam: Combines momentum and adaptive learning rates, usually converges faster than SGD.
  • Adam: 结合了动量和自适应学习率,通常比SGD收敛更快。
  • RMSprop: Adaptively adjusts learning rates, suitable for handling non-stationary objectives.
  • RMSprop: 自适应调整学习率,适合处理非平稳目标。
  • Adagrad: Automatically adjusts learning rates for different parameters, suitable for handling sparse data.
  • Adagrad: 为不同的参数自动调整学习率,适合处理稀疏数据。

4. Code Example


Here's an example of implementing simple backpropagation using PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

# Define a simple neural network
# 定义一个简单的神经网络
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(10, 5)
        self.fc2 = nn.Linear(5, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create model, loss function and optimizer
# 创建模型、损失函数和优化器
model = SimpleNet()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Simulate some training data
# 模拟一些训练数据
x = torch.randn(100, 10)
y = torch.randn(100, 1)

# Training loop
# 训练循环
for epoch in range(100):
    # Forward propagation
    # 前向传播
    outputs = model(x)
    loss = criterion(outputs, y)

    # Backpropagation and optimization
    # 反向传播和优化

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')

5. Frequently Asked Questions


1. How to Choose the Learning Rate?


The learning rate is an important hyperparameter that controls the step size of each parameter update.

A learning rate that is too large may lead to an unstable model, while a learning rate that is too small may lead to slow convergence of the model.

It is usually necessary to find a suitable learning rate through experimentation.

One common method is learning rate decay, which gradually reduces the learning rate as training progresses.

2. How to Avoid Vanishing and Exploding Gradients?


Vanishing gradient and exploding gradient are two common problems in deep neural network training.

Solutions include:

  • Using activation functions like ReLU, avoiding easily saturated activation functions like sigmoid and tanh.
  • 使用ReLU等激活函数,避免使用sigmoid和tanh等容易饱和的激活函数。
  • Using gradient clipping to limit the magnitude of gradients.
  • 使用梯度裁剪来限制梯度的大小。
  • Using batch normalization to normalize the input of each layer.
  • 使用批量归一化来规范化每一层的输入。
  • Using residual connections, as applied in ResNet.
  • 使用残差连接,如在ResNet中的应用。

3. What are the Limitations of Backpropagation?


  • High Computational Cost: Backpropagation requires calculating the gradient of each parameter, which is computationally expensive, especially in deep neural networks.
  • 计算量大: 反向传播需要计算每个参数的梯度,计算量较大,尤其是在深度神经网络中。
  • Prone to Local Optima: The gradient descent algorithm may fall into local optima instead of global optima.
  • 容易陷入局部最优解: 梯度下降算法可能会陷入局部最优解,而不是全局最优解。
  • Requires Large Amounts of Labeled Data: Backpropagation requires a large amount of labeled data for training, otherwise it is prone to overfitting.
  • 需要大量标注数据: 反向传播需要大量的标注数据来进行训练,否则容易出现过拟合。
  • Difficult to Parallelize: Backpropagation is essentially a sequential process, making it difficult to efficiently parallelize in large-scale distributed systems.
  • 难以并行化: 反向传播本质上是一个顺序过程,难以在大规模分布式系统中高效并行化。

6. Practical Applications


The backpropagation algorithm has wide applications in multiple fields:

  1. Computer Vision: In tasks such as image classification and object detection, the training of Convolutional Neural Networks (CNNs) relies on backpropagation.
  2. 计算机视觉: 在图像分类、目标检测等任务中,卷积神经网络(CNN)的训练依赖于反向传播。
  3. Natural Language Processing: Used for training Recurrent Neural Networks (RNNs) and Transformer models, such as BERT and GPT.
  4. 自然语言处理: 用于训练循环神经网络(RNN)和转换器(Transformer)模型,如BERT和GPT。
  5. Recommendation Systems: Used in collaborative filtering and deep recommendation models to learn feature representations of users and items.
  6. 推荐系统: 在协同过滤和深度推荐模型中用于学习用户和物品的特征表示。
  7. Financial Forecasting: Used for tasks such as stock price prediction and risk assessment.
  8. 金融预测: 用于股票价格预测、风险评估等任务。

7. Latest Developments


Although the backpropagation algorithm has existed for many years, it is still continuously developing:

  • Adversarial Gradient Descent: Enhancing model robustness by introducing adversarial samples.
  • 对抗梯度下降: 通过引入对抗样本来增强模型的鲁棒性。
  • Meta-learning: Exploring how to learn learning itself, enabling models to adapt to new tasks more quickly.
  • 元学习: 探索如何学习学习本身,使模型能够更快地适应新任务。
  • Federated Learning: Allowing multiple clients to jointly train a global model while protecting data privacy.
  • 联邦学习: 在保护数据隐私的同时,允许多个客户端共同训练一个全局模型。

8. Summary


Backpropagation is the core of neural network training, enabling neural networks to learn from data and continuously optimize.

Although backpropagation has some limitations, it is still one of the most effective methods for training neural networks.

As deep learning continues to develop, we believe that the backpropagation algorithm will see more improvements and applications.

We hope this blog has helped you gain a deeper understanding of the backpropagation algorithm.

If you have any other questions, please feel free to leave a comment below.

2人评论了“Demystifying Neural Networks (Part 3): Backpropagation – The Learning Tool of Neural Networks|揭秘神经网络(三):反向传播算法—神经网络的学习利器”

  1. I really like your blog.. very nice colors & theme.
    Did you create this website yourself or did you hire someone to do it for you?
    Plz answer back as I’m looking to create my own blog
    and would like to know where u got this from. thanks a lot

    1. Thank you for the kind words about my blog! I’m glad you like the colors and theme. I actually created the website myself . If you’d like to discuss it further, feel free to contact me on WhatsApp at +60177762942. I’d be happy to share some tips on getting started with your own blog.


您的电子邮箱地址不会被公开。 必填项已用 * 标注
