CS 498 Machine Learning System¶
约 824 个字 1 张图片 预计阅读时间 3 分钟
DL computation

Transformer=attention+ MLPs
Recurrent Neural Networks
- LSTM
Two Key Problems of RNNs
Attention:
GNN Architecture
MoE
Stochastic Gradient Descent (SGD)
These slides function as an introductory roadmap for a course on Machine Learning Systems. Since you are new to this, think of this field as the intersection of "math" (the models) and "computer engineering" (how to run that math efficiently on hardware).
The content is divided into three main pillars: Models (the workloads), Optimization (how they learn), and Frameworks (the software tools like PyTorch) .
Deep Learning Workload¶
The "Ingredients" of Deep Learning¶
- Data: The examples you teach the computer with (Images, Text, Audio, etc.).
- Model: The mathematical structure that learns patterns (CNNs, Transformers, etc.).
- Compute: The hardware that does the math (CPUs, GPUs).
The Learning Process (Propagation 传播) :
- Forward Propagation: The data goes into the model, passes through layers, and the model makes a guess (e.g., "Is this a Cat or a dog?").
- Backward Propagation: The system checks if the guess was wrong (Loss). It then sends a signal backward through the model to adjust the internal dials (parameters/weights) so it makes a better guess next time.
Model Architectures¶
1. Convolutional Neural Networks (CNNs)¶
- Best for: Images (Computer Vision).
- How they work: Imagine looking at a picture through a small square window (a filter) and sliding it across the image. This enables the model to identify local features, such as edges, corners, or textures.
- Key Models mentioned:
- ResNet: Uses "identity connections" (shortcuts) to help train very deep networks without getting stuck.
- U-Net: Used for "segmentation"分割 (outlining exactly where an object is in an image).
2. Recurrent Neural Networks (RNNs)¶
- Best for: Sequences 序列 (Text, Audio) where order matters.
- How they work: They process data one step at a time (like reading a sentence word-by-word), keeping an internal "memory" or state of what they have seen so far.
- The Problem: They are slow because they can't do everything at once (lack of parallelizability), and they tend to forget things if the sequence is too long.
- Key Models: LSTM and GRU were invented to help fix the "forgetting" problem. 解决遗忘问题
3. Transformers¶
- Best for: Modern Text processing 文本处理 (e.g., ChatGPT, BERT).
- The Big Idea ("Attention"): Instead of reading word-by-word like an RNN, Transformers use Attention. This allows the model to look at all words in a sentence simultaneously and figure out which words relate to each other (e.g., connecting "it" to "the cat").
- System Benefit: Because they look at everything at once, they are "massively parallelizable," meaning they run very fast on modern hardware (GPUs) compared to RNNs.
4. Graph Neural Networks (GNNs)¶
- Best for: Network data (Social networks, molecular structures).
- How they work: They predict properties of a node (e.g., a person) by aggregating information from their neighbors (e.g., their friends).
5. Mixture-of-Experts (MoE) 专家混合¶
- The Idea: Instead of one giant model doing everything, you have many smaller "expert" models. For every input, a "Router" decides which experts are best suited to handle the problem.
Optimization (How the Model Learns)¶
Optimization is the math used to minimize the model's errors (Loss).
1. Gradient Descent (The Basic Way): Imagine you are standing on a misty mountain and want to get to the bottom. You look at the slope under your feet and take a step downhill. This is Stochastic Gradient Descent (SGD). 随机梯度下降
- Problem: If you use small batches of data, your path is "noisy" (jittery), and you might get stuck in small valleys (local minima).
2. Momentum 动量: To fix the jitter, you add "Momentum." Just like a heavy ball rolling down a hill builds up speed and isn't easily deflected by small bumps, this method remembers the direction it was going and keeps moving that way.
3. Adaptive Methods 自适应方法 (RMSProp & Adam)
Sometimes you need to take big steps, and sometimes small steps.
- RMSProp: Adapts the step size (learning rate) based on how steep the terrain is.
- Adam: The "gold standard" today. It combines Momentum (velocity) and RMSProp (adaptive steps) to learn very efficiently.
Frameworks (The Software)¶
Finally, the slides discuss the tools we use to program these models, like TensorFlow and PyTorch.
Computational Graphs: Deep Learning frameworks represent your code as a "graph" where circles are math operations (add, multiply) and arrows are data flowing between them.
Symbolic vs. Imperative
- Symbolic 符号化 (TensorFlow v1): You define the entire graph structure first, then run it. It's harder to debug but easier for the computer to optimize (make faster) .
- Imperative 命令式 (PyTorch): You run the math line-by-line, just like Python. It is flexible and easy to debug, but historically harder to optimize automatically .
The Modern Solution: Just-in-Time (JIT) Compilation New tools like torch.compile (PyTorch 2.0) try to give you the best of both worlds: the ease of Python with the speed of optimized graphs.