Technology

Separated Architectures for LLM RL Post-Training

Explore the shift to separated architectures for RL post-training of LLMs. Learn how systems like AsyncFlow & TransferQueue solve data orchestration challenges.
Little Boji
8 min read
#RL post-training#separated architecture#LLM post-training#TransferQueue

The Future of LLM Training: Separated Architectures for RL Post-Training

Technical illustration

With the rise of advanced models like OpenAI's o1 and DeepSeek's R1, RL post-training has become a critical stage in developing large language models (LLMs). The AI community is discovering that reinforcement learning (RL) not only aligns models with human values but also significantly boosts their reasoning capabilities. Furthermore, its self-iterating training paradigm offers a potential solution to the data bottleneck limiting pre-training, making it a key area of research.

When implementing reinforcement learning for LLMs, frameworks typically use one of two designs: co-located or separated architectures.

  • Co-located Architecture: Different computational tasks (e.g., training and inference) share the same hardware resources, running serially in a time-division multiplexing fashion.
  • Separated Architecture: Different tasks are assigned to dedicated hardware, operating in a space-division multiplexing model.

Technical illustration

Since early this year, a clear trend has emerged: separated architectures are rapidly becoming the new standard for LLM post-training. This architectural shift is driven by the need to boost efficiency and cut costs.

Technical illustration

Technical illustration

Why Separated Architectures are the Future of RL Post-Training

A separated architecture for RL post-training assigns different computational tasks to dedicated hardware. This approach offers two key advantages over co-located systems: 1. Increased Efficiency: It separates compute-intensive training from memory-bound inference, preventing performance bottlenecks. 2. Reduced Costs: It allows for the use of specialized clusters and heterogeneous hardware, lowering operational expenses.

The Efficiency Problem with Co-located Architectures

The computational profiles of training and inference are fundamentally different. Training is a classic compute-intensive workload, while the decoding phase of inference is memory-bound. Their scaling dynamics diverge significantly, a performance gap that widens as cluster size or sequence length increases.

Technical illustration

In a typical post-training run, the total token counts for inference and training are nearly identical. A co-located architecture forces these distinct workloads to share the same resources, creating a scaling wall that severely bottlenecks performance. This is especially true in large-scale scenarios. Additionally, co-located systems suffer from significant overhead as models are loaded and offloaded between steps and resharded for different tensor parallelism strategies.

Technical illustration

The Cost Advantage of Separated Systems

From a cost perspective, a separated architecture is highly advantageous. The RL framework can act as a high-level scheduler, enabling the reuse of existing, specialized training and inference clusters. This design also supports heterogeneous hardware for post-training, which can dramatically lower operational expenses. As post-training evolves to include more complex components like multi-agent systems and tool use, a serial computation flow becomes a major bottleneck. This is why the evidence points to separated architectures as the future of RL post-training.

The Core Challenges of Separated RL Architectures

If separated architectures are superior, why weren't they always the standard? The answer lies in their inherent complexity, particularly around data orchestration and pipeline efficiency.

Complex Data Orchestration

RL algorithms involve multiple computational tasks with complex, interwoven data dependencies. In a co-located framework, data is managed within a single process. In a separated world, multiple tasks run in parallel, turning data orchestration into a significant challenge. Data must be managed at a finer granularity—down to the micro-batch or even the sample level.

Furthermore, data dependencies are fluid. In RL, response lengths are generated on the fly, making it impossible to pre-calculate token counts. This leads to load-balancing issues that hurt training efficiency. A dynamic, flexible, "pull-based" data routing system is needed, where tasks can grab new work as soon as they have capacity.

Technical illustration

Pipeline Bubbles and GPU Inefficiency

Another challenge stems from the RL algorithm itself. On-policy algorithms require the same model for inference and training. In a separated setup, this creates a stop-and-wait problem: inference workers generate data and then sit idle, waiting for training workers to update and broadcast new model weights. This idle time creates massive "pipeline bubbles," wasting valuable GPU cycles.

Technical illustration

As models and clusters grew, the drawbacks of co-located architectures became too significant to ignore, forcing the community to solve the hard problems of separated systems.

TransferQueue: A Streaming Dataflow Solution for LLMs

To overcome the challenges of data orchestration and pipeline inefficiency, a robust data system is essential. We propose TransferQueue: a data system with a global view and streaming data capabilities designed to orchestrate efficient data flow in complex LLM post-training pipelines.

TransferQueue acts as a central "data hub." Upstream producers (like Actor Rollout) send data to the hub, where it's stored at a granular level. Downstream consumers (like Actor Update) request data when ready, and TransferQueue repackages and sends what's needed. This enables dynamic data routing between any two computational tasks.

Technical illustration

How TransferQueue Decouples Computational Tasks

This design unlocks three powerful benefits for pipeline efficiency:

  1. Task Decoupling: As a central intermediary, TransferQueue decouples computational tasks. This allows for automatic pipeline orchestration and makes it easy to add new stages (like a critique model or safety filter).
  2. Fine-Grained Scheduling: Data is managed at the sample level. The moment enough samples are ready to form a micro-batch, they can be scheduled. This "first-ready, first-served" approach reduces idle time and mitigates the impact of stragglers.
  3. Global Load Balancing: TransferQueue has a global view of all data, allowing it to implement sophisticated load balancing algorithms to maximize system throughput.

The Architecture of TransferQueue: Control and Data Planes

The diagram below illustrates the overall architecture of TransferQueue, which is divided into a control plane and a data plane. The control plane manages global data scheduling, while the distributed data plane stores and transmits the actual training data.

Technical illustration

To enable streaming, fine-grained data management, TransferQueue organizes data in a two-dimensional structure. This design allows us to pinpoint any piece of data with a simple (index, column_name) key, enabling highly concurrent reads/writes and a true dataflow paradigm.

Technical illustration

In the control plane, metadata tables track the production and consumption status of each data sample. When an upstream task writes data, the controller updates its metadata. Once all required data columns for a sample are complete, it becomes available for the next stage.

Technical illustration

With a centralized hub like TransferQueue, the entire RL algorithm can be reframed using a simple dataflow model.

Technical illustration

While systems like StreamRL and AReal use similar streaming concepts, they typically only stream data from the inference to the training cluster. TransferQueue provides a true end-to-end dataflow solution by extending the streaming concept within the training cluster to orchestrate all sub-tasks. We've wrapped TransferQueue into a familiar PyTorch DataLoader interface, abstracting away the complexity.

By leveraging TransferQueue, we can shrink the pipeline bubbles between global batches down to just a few micro-batches.

Technical illustration

Implementation and Future of AsyncFlow

Our proof-of-concept for TransferQueue is built on Ray. While Ray simplifies development, we are refactoring for optimal performance at scale. The following details are illustrative.

A Look at the Code: put_experience and get_batch

The put_experience function encapsulates writing data to the store, updating metadata, and notifying downstream tasks:

def put_experience(self, data: Dict[str, Any]) -> None:
    # 1. Write data to the DataStore
    data_ref = self.data_store.put.remote(data)
    # 2. Update metadata in the Controller
    self.controller.update_metadata.remote(data_ref)
    # 3. Wake up downstream waiting tasks
    self.controller.broadcast.remote()

The Controller's get_batch interface applies a load balancing strategy to ensure workers remain active.

def get_batch(self, batch_size: int, worker_rank: int) -> List[DataRef]:
    while True:
        # 1. Get readable data
        ready_indices = self.get_ready_indices()
        if len(ready_indices) >= batch_size:
            # 2. Select data based on worker_rank and load-balancing strategy
            selected_indices = self.balance_and_select(ready_indices, worker_rank)
            # 3. Update data consumption status
            self.update_read_status(selected_indices)
            return self.get_data_by_indices(selected_indices)
        # 4. If data is insufficient, wait
        self.wait()

The Path Forward: Contributing to Open Source

Looking ahead, TransferQueue will be composed of three core abstractions: DataStore, Controller, and Schema. The shift towards separated architectures marks a pivotal moment in large-scale AI. Solutions like TransferQueue show that a streaming, dataflow-centric approach can overcome the hurdles, boosting efficiency, cutting costs, and paving the way for more sophisticated AI systems.

The ideas discussed here are detailed in our recent arXiv paper, "AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training." We plan to contribute this work to the verl open-source community (see RFC #26623) and invite you to join the discussion.

Related Articles

Technology
16 min

LLM Inference on H800: A Disaggregated Architecture Guide

Explore LLM inference optimization on H800 SuperPods. Learn how a disaggregated architecture with SGLang tackles the prefill bottleneck to boost throughput.

yiakwy
LLM inferencedisaggregated architecture+2 more
Technology
5 min

PyTorch Memory Snapshot: A Guide to GPU Usage Analysis

Monitoring **PyTorch GPU memory usage** during model training can be perplexing. To demystify this, we'll dive into the **PyTorch memory snapshot** tool, a powerful utility for detailed **GPU memory ...

Panda
PyTorch memory snapshotGPU memory analysis+2 more
Technology
6 min

SFT Flaw: A Learning Rate Tweak Unlocks LLM Potential

Discover a critical flaw in Supervised Fine-Tuning (SFT) that limits LLM performance. Learn how a simple learning rate tweak unifies SFT and DPO for a 25% gain.

Noll
Supervised Fine-Tuning (SFT)Direct Preference Optimization (DPO)+2 more

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 8 minutes
Last Updated: July 30, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge