Technology

Optimizing TiledCopy for Memory Coalescing on NVIDIA GPUs

Unlock the full potential of your CUDA kernels by mastering memory coalescing with TiledCopy. This article dives deep into optimizing data transfers from Global to Shared Memory on NVIDIA GPUs, covering cp.async, row-major vs. column-major layouts, and cache line alignment to maximize memory bandwidth and accelerate your deep learning workloads.
Noll
5 min read
#TiledCopy#memory coalescing#cp.async#CUDA

In a previous article, we explored the ldmatrix instruction for efficient matrix transfers from Shared Memory to Registers on NVIDIA GPUs. Here, we focus on optimizing data movement from Global Memory to Shared Memory—a critical step for high-performance CUDA kernels, especially in tensor operations and deep learning workloads.

Why Memory Coalescing Matters in CUDA

Pipelining is essential in GPU programming to hide memory latency and maximize throughput. NVIDIA's CuTe C++ template library abstracts the SM80 cp.async instruction into a high-level interface called TiledCopy. This enables asynchronous, coalesced data transfers from Global Memory to Shared Memory, allowing Tensor Cores to compute while data loads in parallel.

What is TiledCopy?

TiledCopy is a CuTe abstraction for efficiently copying tiles of data between memory spaces. It is highly configurable via thread and value layouts, making it adaptable for various tensor shapes and memory layouts.

auto tiled_copy = make_tiled_copy(
    Copy_Atom<UniversalCopy, half_t>{},
    Layout<Shape<_16, _8>, Stride<_8, _1>>{}, // ThrLayout
    Layout<Shape<_1, _4>>{});                 // ValLayout

TiledCopy operation showing thread layout and value layout configuration

Understanding Tiler_MN and TiledLayout_TV

What is Tiler_MN?

  • Tiler_MN defines the shape of the tile that a single TiledCopy operation moves.
  • It is determined by the thread layout (ThrLayout) and value layout (ValLayout).
  • TiledCopy is agnostic to the stride of the source tensor, making it flexible for row-major, column-major, or custom layouts.

For example, in a 4096x4096 row-major matrix, a (16, 64) tile has a stride of (4096, 1). In column-major, the stride is (1, 4096). TiledCopy works efficiently regardless, as long as the tile shape matches.

How is Tiler_MN Computed?

  • Copy_Atom<UniversalCopy, half_t>: Each thread copies one half_t element.
  • ValLayout<Shape<_1, _4>>: Each thread copies a (1, 4) tile.
  • ThrLayout<Shape<_16, _8>, Stride<_8, _1>>: Threads are arranged in a 16x8 grid.
  • Total tile shape: (16 threads, 32 elements) = (16, 32) for Tiler_MN.

What is TiledLayout_TV?

TiledLayout_TV maps a thread's ID and its local element coordinate to a global coordinate in the tile. For example, thread 9 in a (16, 8) grid accesses its (0, 2) element, which maps to a unique position in the (16, 32) tile.

TiledLayout_TV mapping from thread ID to global tile coordinates

Memory Coalescing Patterns

Row-Major Layout: Optimal Coalescing

When copying a (16, 32) tile from a row-major matrix (stride (4096, 1)), threads access contiguous memory. Each group of 8 threads copies 32 consecutive elements, maximizing memory bandwidth.

Thread ID01234567
Offset0-34-78-1112-1516-1920-2324-2728-31

Column-Major Layout: Suboptimal Coalescing

For column-major matrices (stride (1, 4096)), threads access non-contiguous memory, resulting in inefficient, strided accesses and reduced bandwidth.

Thread ID01234567
Offset0, 4096,...4, 4100,...8, 4104,..................

Solution: Transposed TiledCopy for Column-Major

auto tiled_copy = make_tiled_copy(
    Copy_Atom<UniversalCopy, half_t>{},
    Layout<Shape<_8, _16>, Stride<_16, _1>>{}, // ThrLayout
    Layout<Shape<_4, _1>>{});                  // ValLayout

This layout ensures threads access contiguous memory even in column-major matrices, restoring coalesced access.

Leveraging cp.async for Asynchronous Copies

The cp.async instruction (SM80+) enables hardware-accelerated, asynchronous memory copies. To use it efficiently:

auto tiled_copy = make_tiled_copy(
    Copy_Atom<SM80_CP_ASYNC_CACHEGLOBAL<cute::uint128_t>, half_t>{},
    Layout<Shape<_16, _8>, Stride<_8, _1>>{}, // ThrLayout
    Layout<Shape<_1, _8>>{});                 // ValLayout
  • Copy_Atom<SM80_CP_ASYNC_CACHEGLOBAL<cute::uint128_t>, half_t>: Each thread copies 8 half_t elements (16 bytes) at once.
  • ValLayout<Shape<_1, _8>>: Each thread's tile is contiguous, matching the 128-bit vector requirement.

Note: For column-major, use the transposed version to ensure contiguous access:

auto tiled_copy = make_tiled_copy(
    Copy_Atom<SM80_CP_ASYNC_CACHEGLOBAL<cute::uint128_t>, half_t>{},
    Layout<Shape<_8, _16>, Stride<_16, _1>>{}, // ThrLayout
    Layout<Shape<_8, _1>>{});                  // ValLayout

Cache Line Alignment for Maximum Bandwidth

A well-chosen TiledCopy aligns thread groups to cache lines. For example, 8 threads each copying 16 bytes access 128 bytes (one cache line). A full warp (32 threads) loads 4 cache lines efficiently.

auto tiled_copy = make_tiled_copy(
    Copy_Atom<SM80_CP_ASYNC_CACHEGLOBAL<cute::uint128_t>, half_t>{},
    Layout<Shape<_32, _4>, Stride<_4, _1>>{}, // ThrLayout
    Layout<Shape<_1, _8>>{});                 // ValLayout

Here, only 64 bytes are accessed per group, wasting half the cache line bandwidth. Always align thread and value layouts to cache line sizes (128 bytes) for optimal performance.

Conclusion

TiledCopy and cp.async enable high-throughput data transfers from global to shared memory in CUDA. By carefully choosing thread and value layouts, you can ensure memory coalescing, maximize cache line utilization, and fully exploit NVIDIA GPU hardware. For further optimization, explore how shared memory layouts (e.g., Swizzle) can avoid bank conflicts and further boost performance.

Related Articles

Technology
6 min

SFT Flaw: A Learning Rate Tweak Unlocks LLM Potential

Discover a critical flaw in Supervised Fine-Tuning (SFT) that limits LLM performance. Learn how a simple learning rate tweak unifies SFT and DPO for a 25% gain.

Noll
Supervised Fine-Tuning (SFT)Direct Preference Optimization (DPO)+2 more
Technology
7 min

Two Major Challenges in Reinforcement Learning Finally Solved by ICLR Papers

Traditional reinforcement learning models struggle with real-time applications due to "AI lag." Two ICLR 2025 papers from Mila introduce groundbreaking solutions to tackle inaction and delay regret, enabling large AI models to operate in high-frequency, dynamic environments without compromising speed or intelligence.

Noll
TechnologyAI+1 more
Technology
13 min

Discuss the infrastructure requirements of Agentic AI.

The rise of Agentic AI places unprecedented demands on our infrastructure. This article explores the emerging software and hardware requirements, from specialized runtimes and memory services to zero-trust security models, dissecting AWS's new Bedrock AgentCore platform and discussing the future of AI infrastructure.

Noll
TechnologyAI+1 more

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 5 minutes
Last Updated: July 20, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge