Aurora Logo

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

*Equal Contribution    Core Contributor
The project leads are Xiaoxia Wu, Chenfeng Xu, and Junxiong Wang.
Together AI
Aurora system in action

Aurora in Action: Real-time demonstration of the unified training-serving system

Why Aurora?

Traditional speculative decoding suffers from training-serving mismatch. Aurora solves this with a unified, continuously adaptive system.

Day-0 Support
Deploy and adapt from scratch without offline pretraining.
1.57×
Speedup on MiniMax M2.1 in mixed-data scenarios.
1.25×
Speedup over static speculators on Qwen3/Llama3.
Unified Loop
Online learning from live traffic reduces distribution shift.
Cost Efficient
Optimized serving footprint through continuous adaptation.
Open Source
Built on SGLang with full implementation available.

Key Finding: Online training from scratch can exceed the performance of a pretrained speculator, challenging the conventional wisdom that speculative decoding requires offline pretraining.

Abstract

Speculative decoding can significantly accelerate LLM serving, yet most deployments today decouple speculator training from serving, treating it as a standalone offline modeling problem. This decoupled formulation introduces substantial challenges: high time-to-serve since speculators require extensive offline training before deployment, delayed utility feedback as true speedup is only known after deployment, and domain-drift degradation when target models adapt to new domains while speculators remain static.

We present Aurora, a unified training–serving system that closes this loop by continuously learning a speculator directly from live inference traces. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment—a speculator can be served immediately and rapidly adapted to live traffic. Across experiments, we achieve 1.45× speedup on recently released frontier models (MiniMax M2.1 229B and Qwen3-Coder-Next 80B) starting from scratch, and an additional 1.25× speedup over well-trained static speculators on widely used models (Qwen3 and Llama3), demonstrating effective adaptation to distribution shifts.

Motivation

Most speculative decoding systems separate training from serving, introducing three key challenges:

1. Training-Serving Mismatch

Offline training optimizes acceptance in controlled settings, but production speedups depend on deployment details like kernel implementations, precision (FP8/FP4), and batching. Strong offline performance may not translate to production, creating an optimization gap that undermines real-world efficiency gains.

2. Verifier Drift

Target models update frequently, but drafters refresh slowly due to retraining costs. This creates staleness and degrades performance over time, as the drafter falls increasingly out of sync with evolving model outputs.

3. Infrastructure Cost

Off-policy distillation requires collecting large volumes of model activations, leading to high storage and operational costs at scale. The infrastructure burden becomes prohibitive for continuous model updates.

System Architecture

Aurora features a decoupled architecture with two key components:

Inference Server: Runs an SGLang-based speculative decoding engine with a target model and draft model. For each request, both accepted and rejected tokens are streamed to a distributed data buffer for continuous training.

Training Server: Asynchronously learns from live serving data in the buffer, performing gradient updates on a copy of the draft model and hot-swapping improved weights back to the inference server without service interruption.

Aurora training-inference framework

Aurora Training-Inference Framework: Online speculative training with asynchronous RL-style updates. A production inference server runs speculative decoding with a fixed target (verifier) and a lightweight draft (speculator), where proposed tokens are accepted or rejected during verification.

Aurora system architecture

Method

Online Speculator Training as Asynchronous RL

We view online speculative decoding as an asynchronous reinforcement learning system. The draft model acts as the policy π, and the target model plus verifier implement the environment. Each speculative step forms a short episode: the policy proposes a tree of candidate continuations, the verifier accepts a prefix, and the outcome provides structured feedback. Accepted tokens correspond to positive reward, while rejected tokens provide zero/negative reward.

Learning from Acceptance and Rejection

Verification yields richer supervision than acceptance-only imitation. We train the draft model with two complementary signals:

  • Acceptance loss (imitation): Cross-entropy on accepted tokens, encouraging the draft to reproduce verifier-approved continuations.
  • Rejection loss (counterfactual feedback): Rejected branches specify what the policy should not propose. With Discard Sampling, we apply a KL-based objective that pushes probability mass away from incorrect predictions.

The total loss is a weighted combination:

L = Ex~paccept[KL(ptarget || pdraft)] + λdiscard Ex~pdiscard[KL(ptarget || pdraft)]

Efficient Tree Attention

We employ a specialized Tree Attention mechanism to efficiently process the complex branching structure of speculative decoding results. By constructing a custom attention mask that respects the causal structure of the speculative tree, we can process all accepted and rejected branches in a single batched forward and backward pass.

Experimental Results

Day-0 Deployment: Training from Scratch

Aurora enables day-0 serving: an untrained speculator can be deployed immediately and become production-ready through online adaptation. In mixed traffic scenarios, acceptance length reaches competitive levels.

Qwen3-Coder-Next acceptance length
Qwen3-Coder-Next throughput

Qwen3-Coder-Next-FP8: Aurora (Scratch) raises acceptance length to 3, delivering 1.21× throughput improvement (batch size 8, averaged over final 10k steps after 1k warm-up).

MiniMax M2.1 acceptance length
MiniMax M2.1 throughput

MiniMax M2.1: Aurora (Scratch) increases acceptance length to 2.8, achieving 1.45× throughput gains over baseline.

Adaptation to Distribution Shift

When requests are grouped by domain to induce abrupt distribution changes, Aurora adapts continuously. The system recovers acceptance length within approximately 10,000 requests after each shift.

Domain shift acceptance

Aurora adapts to domain shifts, recovering performance

Domain shift throughput

Throughput maintains competitiveness despite domain shifts

Training with Existing Speculator

Starting from a trained speculator, Aurora achieves 1.25× speedup over the static baseline through continuous adaptation.

Trained spec acceptance
Trained spec throughput

Starting from a trained speculator, Aurora delivers 1.25× throughput improvement over static baseline under domain shifts.

Batch size ablation acceptance
Batch size ablation throughput

Batch size ablation study showing performance across different batch configurations.

Coding Domain Performance

Aurora also demonstrates strong performance on coding-specific workloads across different model families.

Coding benchmark Qwen

Coding benchmark results (Qwen3-8B)

Coding benchmark LLaMA

Coding benchmark results (LLaMA 3.1-8B)

Conclusion

We presented Aurora, a unified training-serving system that reframes speculative decoding as a joint learning-and-serving problem. By connecting an SGLang-based inference server with an asynchronous training server via GPU-aware RPC, Aurora enables continuous on-policy adaptation of the draft model under live traffic, closing the training-serving mismatch that limits conventional two-stage pipelines.

Our experiments show that simple online fine-tuning captures most attainable gains, that lazy synchronization best balances adaptation speed with serving stability, and that day-0 deployment from scratch is practical—an untrained speculator reaches competitive acceptance rates within thousands of requests, eliminating the offline pretraining bottleneck for onboarding new models.

BibTeX

@article{aurora2026,
  title={When RL Meets Adaptive Speculative Training: A Unified Training-Serving System},
  author={Wang, Junxiong and Bie, Fengxiang and Li, Jisen and Zhou, Zhongzhu and Shao, Zelei and Wang, Yubo and Liu, Yinghui and Wu, Qingyang and May, Avner and Yanamandra, Sri and Zhang, Yineng and Zhang, Ce and Dao, Tri and Liang, Percy and Athiwaratkun, Ben and Song, Shuaiwen Leon and Xu, Chenfeng and Wu, Xiaoxia},
  journal={arXiv preprint arXiv:2602.06932},
  year={2026},
  url={https://arxiv.org/abs/2602.06932}
}