ECE408/CS483 Applied Parallel Programming¶
约 1325 个字 154 行代码 18 张图片 预计阅读时间 6 分钟
https://canvas.illinois.edu/courses/60979/assignments/syllabus
Introduction¶
CPU(central processing unit)
GPU(graphical processing unit)
Post-Dennard technology pivot – parallelism and heterogeneity¶
The Moore’s Law (Imperative) drove feature sizes down, doubling the number of transistors/unit area every 18-24 months
- Exponential increase in clock speed
Dennard Scaling (based on physics) drove clock speeds up
- ended around 2005-2006
multicore: execution speed of sequential programs
many-thread: execution throughput of parallel applications
CPU vs GPU¶
CPU | GPU |
---|---|
A few powerful ALUs(Arithmetic Logic Unit) | Many small ALUs |
Reduced operation latency | Long latency, high throughput |
Large caches | Heavily pipelined for further throughput |
Convert long latency memory accesses to short latency cache accesses | Small caches |
Sophisticated control | More area dedicated to computation |
Branch prediction to reduce control hazards | Simple control |
Data forwarding to reduce data hazards | More area dedicated to computation |
Modest multithreading to hide short latency | A massive number of threads to hide the very high latency! |
High clock frequency | Moderate clock frequency |
CPUs for sequential parts where latency hurts
- CPUs can be 10+X faster than GPUs for sequential code
GPUs for parallel parts where throughput wins
- GPUs can be 10+X faster than CPUs for parallel code
Parallel Programming Frameworks¶
[!NOTE]
Why GPUs?
Why repurpose a graphics processing architecture instead of designing a throughput-oriented architecture from scratch?
- Chips are expensive to build and require a large volume of sales to amortize the cost
- This makes the chip market very difficult to penetrate
- When parallel computing became mainstream, GPUs already had (and still have) a large installed base from the gaming sector
Parallel Computing Challenges¶
Massive Parallelism demands Regularity -> Load Balance
Global Memory Bandwidth -> Ideal vs. Reality
Conflicting Data Accesses Cause Serialization and Delays
- Massively parallel execution cannot afford serialization
- Contentions in accessing critical data causes serialization
Parallel Computing Pitfall(陷阱)¶
Consider an application where:
- The sequential execution time is 100s
- The fraction of execution that is parallelizable is 90%
- The speedup achieved on the parallelizable part is 1000×
What is the overall speedup of the application? $$ t_{parallel}=(1-0.9)\times 100s +\frac{0.9 \times 100s}{1000}=10.09s\ speedup=\frac{t_{sequential}}{t_{parallel}}=\frac{100s}{10.09s}=9.91\times \text{(9.91为倍数)} $$
Amdahl's Law¶
The maximum speedup of a parallel program is limited by the fraction of execution that is parallelizable, namely, \(speedup<\frac{1}{1-p}\)
Introduction to CUDA C and Data Parallel Programming¶
Types of Parallelism¶
Task Parallelism | Data Parallelism |
---|---|
Different operations performed on same or different data | Same operations performed on different data |
Usually, a modest number of tasks unleashing a modest amount of parallelism | Potentially massive amounts of data unleashing massive amounts of parallelism(Most suitable for GPUs) |
![]() |
![]() |
CUDA/OpenCL Execution Mode¶
Integrated Host +Device Application(C Program)
- The execution starts with host code (CPU serial code).
- When a kernel function is called, a large number of threads are launched on a device to execute the kernel. All the threads that are launched by a kernel call are collectively called a grid.
- These threads are the primary vehicle of parallel execution in a CUDA platform
- When all threads of a grid have completed their execution, the grid terminates, and the execution continues on the host until another grid is launched
- Host Code (C):Handles serial or modestly parallel tasks
- Device Kernel (C,SPMD Model):Executes highly parallel sections of the program
Threads¶
A CUDA kernel is executed as a grid(array) of threads
- All threads in the same grid run the same kernel
- Single Program Multiple Data (SPMD model)
- Each thread has a unique index that it uses to compute memory addresses and make control decisions
Thread as a basic unit of computing
- Threads within a block cooperate via shared memory, atomic operations and barrier synchronization. 块内的线程通过共享内存、原子操作和屏障同步进行协作。
- Threads in different blocks cooperate less.
- Thread block and thread organization simplify memory addressing when processing multidimensional data
Vector Addition¶
We use vector addition to demonstrate the CUDA C program structure.
A simple traditional vector addition C code example.
主机的变量名称后缀为_h
,使用设备的变量名称后缀为_d
System Organization¶
The CPU and GPU have separate memories and cannot access each others' memories
- Need to transfer data between them(下图五步操作)
A vector addition kernel¶
Outline of a revised vecAdd function that moves the work to a device.
vector A + B = vector C
Device code can:
- R/W per-thread registers
- R/W per-grid global memory
Host code can transfer data to/from per grid global memory
CUDA Device Memory Management API¶
API for managing device global memory¶
Allocating memory
Deallocating memory
- 指向设备全局内存中对象的指针变量后缀为
_d
A_d
,B_d
和C_d
中的地址指向设备全局内存 device global memory 中的位置。这些地址不应在主机代码中间接引用。它们应该在调用 API 函数和内核函数时使用。
Copying memory
dst
: Destination memory addresssrc
: Source memory addresscount
: Size in bytes to copykind
: Type of transfercudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
Return type: cudaError_t
- Helps with error checking (discussed later)
vecAdd Host Code
完整版本
Simple strategy of Parallel Vector Addition: assign one GPU thread per vector element
Launching a Grid¶
Threads in the same grid execute the same function known as a kernel
A grid can be launched by calling a kernel and configuring it with appropriate grid and block sizes:
If n is not a multiple of numThreadsPerBlock
, fewer threads will be launched than desired
- Solution: use the ceiling to launch extra threads then omit the threads after the boundary:
More Ways to Compute Grid Dimensions
Vector Addition Kernel¶
Compiling A CUDA Program¶
Function Declarations in CUDA¶
__global__
defines a kernel function
__device__
and __host__
can be used together
More on Function Declarations¶
The keyword __host__
is useful when needing to mark a function as executable on both the host and the device
Asynchronous Kernel Calls¶
By default, kernel calls are asynchronous 异步
- Useful for overlapping GPU computations with CPU computations
Use the following API function to wait for the kernel to finish
- Blocks until the device has completed all preceding requested tasks
Error Checking¶
All CUDA API calls return an error code cudaError_t
that can be used to check if any errors occurred
For kernel calls, one can check the error returned by cudaDeviceSynchronize()
or call the following API function:cudaError_t cudaGetLastError()
Problems¶
CUDA Parallel Execution Model: Multidimensional Grids & Data¶
CUDA Thread Grids are Multi-Dimensional¶
CUDA supports multidimensional grids (up to 3D)
Each CUDA kernel is executed by a grid,
- a 3D array of thread blocks, which are 3D arrays of threads.
- Each thread executes the same program on distinct data inputs, a single-program, multiple-data (SPMD) model
Grid - block - thread
gridDim
-blockIdx
-threadIdx
One Dimensional Indexing¶
Defining a working set for a thread
i = blockIdx.x * blockDim.x + threadIdx.x;
Multidimensional Indexing¶
Defining a working set for a thread
row = blockIdx.y * blockDim.y + threadIdx.y;
col = blockIdx.x * blockDim.x + threadIdx.x;
Configuring Multidimensional Grids¶
Use built-in dim3
type¶
Layout of Multidimensional Data¶
- Convention is C is to store data in row major order
- Elements in the same row are contiguous in memory
index = row * width + col
RGB to Gray-Scale Kernel Implementation¶
Blur Kernel Implementation¶
Parallelization approach: assign one thread to each output pixel, and have it read multiple input pixels
- Given two N × N matrices, A and B, we can multiply A by B to compute a third N × N matrix, P: P = AB
[!NOTE]
Rule of thumb: every memory access must have a corresponding guard that compares its indexes to the array dimensions
Matrix-Matrix Multiplication¶
Given two N × N matrices, A and B, we can multiply A by B to compute a third N × N matrix, P: \(P = AB\)
矩阵相乘,一行✖️一列
- Parallelization approach: assign one threadto each element in the output matrix (C)