High Performance Computing 25 SP NVIDIA

2025-08-31

# 高性能计算

# 学习资料

总字数：8958字，预计阅读时间 14分 55秒。

Fxxk you, NVIDIA!

CPU/GPU Parallelism:

Moore's Law gives you more and more transistors:

CPU strategy: make the workload (one compute thread) run as fast as possible.
GPU strategy: make the workload (as many threads as possible) run as fast as possible.

GPU Architecture:

Massively Parallel
Power Efficient
Memory Bandwidth
Commercially Viable Parallelism
Not dependent on large caches for performance

Nvidia GPU Generations

2006: G80-based GeForce 8800
2008: GT200-based GeForce GTX 280
2010: Fermi
2012: Kepler
2014: Maxwell
2016: Pascal
2017: Volta
2021: Ampere
2022: Hopper
2024: Blackwell

2006: G80 Terminology

SP: Streaming Processor, scalar ALU for a single CUDA thread

SPA: Stream Processor Array

SM: Streaming Multiprocessor, containing of 8 SP

TPC: Texture Processor Cluster: 2 SM + TEX

Design goal: performance per millimeter

For GPUs, performance is throughput, so hide latency with computation not cache.

So this is single instruction multiple thread (SIMT).

Thread Life Cycle:

Grid is launched on the SPA and thread blocks are serially distributed to all the SM.

SIMT Thread Execution:

Groups of 32 threads formed into warps. Threads in the same wraps always executing same instructions. And some threads may become inactive when code path diverges so the hardware automatically Handles divergence.

Warps are the primitive unit of scheduling.

SIMT execution is an implementation choice. As sharing control logic leaves more space for ALUs.

SM Warp Scheduling:

SM hardware implements zero-overhead warp scheduling:

Warps whose next instruction has its operands ready for consumption are eligible for execution.
Eligible warps are selected for execution on a prioritized scheduling policy.

If 4 clock cycles needed to dispatch the same instructions for all threads in a warp, and one global memory access is needed for every 4 instructions and memory latency is 200 cycles. So there should be 200 / (4 * 4) =12.5 (13) warps to fully tolerate the memory latency

The SM warp scheduling use scoreboard and similar things.

Granularity Consideration:

Consider that int the G80 GPU, one SM can run 768 threads and 8 thread blocks, which is the best tiles to matrix multiplication: 16 * 16 = 256 and in one SM there can be 3 thread block which fully use the threads.

2008: GT200 Architecture

2010: Fermi GF100 GPU

Fermi SM:

There are 32 cores per SM and 512 cores in total, and introduce 64KB configureable L1/ shared memory.

Decouple internal execution resource and dual issue pipelines to select two warps.

And in Fermi, the debut the Parallel Thread eXecution(PTX) 2.0 ISA.

2012 Kepler GK 110

2014 Maxwell

4 GPCs and 16 SMM.

2016 Pascal

No thing to pay attention to.

2017 Volta

First introduce the tensor core, which is the ASIC to calculate matrix multiplication.

2021 Ampere

The GA100 SM:

2022 Hopper

Introduce the GH200 Grace Hopper Superchip:

A system contains a CPU and GPU which is linked by a NVLink technology.

And this system can scale out for machine learning.

Memory access across the NVLink:

GPU to local CPU
GPU to peer GPU
GPU to peer CPU

These operations can be handled by hardware accelerated memory coherency. Previously, there are separate page table for CPU and GPU but for GPU to access memory in both CPU and GPU, CPU and GPU can use the same page table.

2025 Blackwell

Compute Capability

The software version to show hardware version features and specifications.

G80 Memory Hierarchy

Memory Space

Each thread can

Read and write per-thread registers.
Read and write per-thread local memory.
Read and write pre-block shared memory.
Read and write pre-grid global memory.
Read only pre-grid constant memory.
Read only pre-grid texture memory.

Parallel Memory Sharing:

Local memory is per-thread and mainly for auto variables and register spill.
Share memory is pre-block which can be used for inter thread communication.
Global memory is pre-application which can be used for inter grid communication.

SM Memory Architecture

Threads in a block share data and results in memory and shared memory.

Shared memory is dynamically allocated to blocks which is one of the limiting resources.

SM Register File

The tex pipeline and local/store pipeline can read and write register file.

Registers are dynamically partitioned across all blocks assigned to the SM. Once assigned to a block the register is not accessible by threads in other blocks and each thread in the same block only access registers assigned to itself.

For a matrix multiplication example:

If one thread uses 10 registers and one block has 16x16 threads, each SM can contains three thread blocks as one thread blocks need 16 * 16 * 10 =2,560 registers and 3 * 2560 < 8192.
But if each thread need 11 registers, one SM can only contains two blocks once as 8192 < 2816 * 3.

More on dynamic partitioning: dynamic partitioning gives more flexibility to compilers and programmers.

A smaller number of threads that require many registers each.
A large number of threads that require few registers each.

So there is a tradeoff between instruction level parallelism and thread level parallelism.

Parallel Memory Architecture

In a parallel machine, many threads access memory. So memory is divided into banks to achieve high bandwidth.

Each bank can service one address per cycle. If multiple simultaneous accesses to a bank result in a bank conflict.

Shared memory bank conflicts:

The fast cases:
- All threads of a half-warp access different banks, there's no back conflict.
- All threads of a half-warp access the identical address ,there is no bank conflict (by broadcasting).
The slow cases:
- Multiple threads in the same half-warp access the same bank

Memory in Later Generations

Fermi Architecture

Unified Addressing Model allows local, shared and global memory access using the same address space.

Configurable Caches allows programmers to configure the size if L1 cache and the shared memory.

The L1 cache works as a counterpart to shared memory:

Shared memory improves memory access for algorithms with well defined memory access.
L1 cache improves memory access for irregular algorithms where data addresses are not known before hand.

Pascal Architecture

High Bandwidth Memory: a technology which enables multiple layers of DRAM components to be integrated vertically on the package along with the GPU.

Unified Memory provides a single and unified virtual address space for accessing all CPU and GPU memory in the system.

And the CUDA system software doesn't need to synchronize all managed memory allocations to the GPU before each kernel launch. This is enabled by memory page faulting.

Advanced GPU Features

GigaThread

Enable concurrent kernel execution:

And provides dual Streaming Data Transfer engines to enable streaming data transfer, a.k.a direct memory access.

GPUDirect

GPU Boost

GPU Boost works through real time hardware monitoring as opposed to application based profiles. It attempts to find what is the appropriate GPU frequency and voltage for a given moment in time.