Understanding FlashAttention-3 on NVIDIA Hopper

The attention mechanism in transformer models is notoriously memory-bound. While calculating self-attention, a model must compute query, key, and value matrices, a process that scales quadratically with sequence length. On modern GPUs, the math itself runs incredibly fast on specialized Tensor Cores. Slowdowns happen when the GPU must constantly write intermediate attention matrices back to its high-bandwidth memory (HBM) and read them back in for the next step.

The original FlashAttention algorithm resolved a massive portion of this overhead by tiling: loading blocks of data into the GPU's fast, local Static RAM (SRAM), computing attention locally, and writing only the final output back to HBM. FlashAttention-2 optimized this further by tuning work distribution among the GPU's warp schedulers. However, modern silicon like the NVIDIA Hopper architecture (H100) introduced new physical capabilities that required a complete rethink of how software interacts with hardware. FlashAttention-3 adapts directly to these microarchitectural shifts, unlocking near-theoretical maximum speeds on Hopper GPUs.

The Bottleneck: Waiting on High-Bandwidth Memory

On the latest accelerator chips, compute capacity has scaled much faster than memory bandwidth. The GPU can perform arithmetic operations at blistering speeds, but feeding those processing units with data is a constant challenge. When a GPU executes standard attention math, the Tensor Cores spend a significant amount of time sitting idle, waiting for the next block of keys or values to arrive from HBM.

On Hopper architectures, this compute-to-memory ratio is even more extreme. The introduction of FP8 precision effectively doubled the theoretical processing throughput of the hardware. If the memory subsystem cannot fetch data twice as fast to match, that extra compute power is wasted. To solve this, software must find a way to make memory transfers completely invisible to the processing cores.

Asynchronous Memory Transfer via TMA

Historically, if a program wanted to copy data from global GPU memory to shared local memory, the Streaming Multiprocessor (SM) had to actively manage the transfer. This meant the GPU spent precious register space and instruction-issue bandwidth simply moving bytes around, leaving fewer resources available for actual matrix multiplication.

NVIDIA Hopper introduced a dedicated hardware unit called the Tensor Memory Accelerator (TMA). TMA allows the chip to move multidimensional tensors between HBM and shared memory completely in the background.

FlashAttention-3 utilizes TMA to implement a pipelining strategy:

  1. The system issues a TMA command to load the next block of keys and values from global memory into shared memory.

  2. While that data is in transit, the Tensor Cores are actively multiplying the current block of queries and keys.

  3. The SM threads do not block or wait; they immediately begin executing the current math step, knowing the hardware will signal them when the background transfer is complete.

This asynchronous handoff ensures that data movement and mathematical computation happen at the exact same time. The latency of reading from HBM is effectively hidden behind the execution time of the previous attention block.

Leveraging FP8 and WGMMA Instructions

Operating at lower precision is one of the most effective ways to speed up model execution, but it introduces major numerical stability challenges. FlashAttention-3 natively supports FP8 (8-bit floating-point) precision, which cuts the memory footprint of the KV cache in half compared to FP16.

To maintain accuracy at this low precision, the algorithm implements block-wise scaling factors. Before running matrix multiplication, the values in a block are dynamically scaled to prevent underflow or overflow, a technique that keeps the final attention output mathematically stable.

Additionally, FlashAttention-3 utilizes Hopper's Warp Group Matrix Multiply-Accumulate (WGMMA) instructions. Older architectures required threads to hold matrix weights in their private registers during computation, which restricted the amount of memory available for other operations. WGMMA allows a larger group of 128 threads to execute matrix math directly using data stored in shared memory. This frees up register files, allowing the GPU to run larger batch sizes and manage more complex attention operations without spilling data to slower memory layers.

Real-World Performance Implications

The architectural optimizations in FlashAttention-3 provide a massive jump in real-world performance. On NVIDIA H100 GPUs, FlashAttention-3 achieves up to 1.2 PetaFLOPs in FP8 precision. This represents roughly 75% of the theoretical maximum performance of the physical silicon.

For developers deploying models with long context windows (such as document analysis or multi-turn conversational agents), this hardware-aligned optimization translates directly to lower latency and significantly higher throughput per GPU. Rather than waiting for faster chips, optimizing the software to match the asynchronous, low-precision reality of modern silicon is how we achieve the next order of magnitude in performance.

Back to Main   |  Share