Team: Zengzhi Wang, Fan Zhou, Xuefeng Li*, Pengfei Liu.

This is only a progress report. We are still actively working on it.

[🧑‍💻Github][🤗HF][🔖Poster]

<aside> 📌

TL;DR

Why could different model families (e.g., Llama, Qwen) produce significantly different training behavior during post-training (i.e., RL from Base model)? This blog explores how different early pre(mid)-training strategies’ could bring impact to post-training stages, especially during the period of Reinforcement Learning (RL). Qwen2.5-3B-Base and Llama3.2-3B-Base exhibit significantly different training dynamics and performance during RL.

Through extensive experiments on Qwen2.5-3B-Base and Llama3.2-3B-Base—primarily mid-training variants followed by RL—we identified several key factors: (1) mid-training on math corpora substantially improves both the base model and RL performance however data quality is vital: for example, MegaMath-Web-Pro shows significant improvement while FineMath-4plus doesn’t. (2) Incorporating QA data, especially long-chain-of-thought (long-CoT), enhances both the base model and RL outcomes. Adding a small amount of instruction-following data further unlocks the potential of QA data. (3) Although Long-CoT data in mid-training thrives RL, but can cause excessive response length and sometimes slight performance degradation; therefore data format matters, as well. (4) Scaling up mid-training budget consistently improves both base and RL performance.
Based on the above insights, we designed a two-stage mid-training strategy with careful and rigorous ablation on Llama-3.2 series models. The first stable stage uses a 200B token budget, followed by a decay stage branching into three branches with the incorporation of short-CoT, long-CoT, and their mixture—each with a 20B budget, denoted as Short, Long and Hybrid. We observed consistent performance gains: the Stable stage improves over the base model, and each Decay branch further advances performance. These progressively refined models serve as strong starting points for RL scaling. Empirical results show that applying RL on these different branched foundations leads to both substantial gains and stable training. These findings offer practical guidance for building better base models in the RL era.

Note that we are still in the process of exploring more possibilities and expand to different model families, but we are eager to share some findings with the community from our empirical results. We hold the hope of reshaping the pre-training stage of LLMs, in the era of RL scaling.

</aside>

Left: OctoThinker’s RL dynamics which greatly outperforms Llama-3.2 and matches the Qwen model; Right: pass@K coverage curve of OctoThinker and LLama’s Base and Zero series of models.

Left: OctoThinker’s RL dynamics which greatly outperforms Llama-3.2 and matches the Qwen model; Right: pass@K coverage curve of OctoThinker and LLama’s Base and Zero series of models.

1. Preliminaries: Divergent dynamics between Qwen vs. Llama

Deepseek R1-Zero demonstrates numerous powerful and interesting reasoning behaviors by directly performing large-scale reinforcement learning from base language models, i.e., Deepseek-V3-Base. Then, there are some follow-up works, such as SimpleRL and Open-Reasoner-Zero, to perform R1-Zero like training in smaller base language models (such as Qwen2.5-Math-7B, Qwen2.5-7B-Base, and Qwen2.5-32 B-Base), obtaining improved reasoning performance. Interestingly, existing successful R1-Zero like training replicates are mostly based on Qwen families whereas it seems challenging to replicate it on other popular general base language models, such as the Llama family, also recently evidenced by Gandhi et al., 2025 and Liu et al., 2025. This motivate us to perform preliminary study on representative models, also convenient for conducting controllable experiments to dive the reasons behind it.

1.1 Experimental Setup

RL Setup. We perform our RL experiments based on veRL framework and utilize the GRPO algorithm. We adopt MATH8K dataset as RL training prompts due to its moderate difficulty and concise composition. We set the global training batch size as 128, the number of rollout response per query as 16, PPO mini batch size as 64, sampling temperature as 1.0, max output length as 4096, learning rate as 1e-6, KL loss coef as 0 in veRL. We found that setting the ratio of sampling and gradient updating as 2 could clearly make RL training more steady. We employ a straightforward prompt template, “Question: {}\nAnswer: {}”, to format the training examples by default.

RL experimental configurations.

RL experimental configurations.

Base Model. We directly employ Llama-3.2-3B-Base and Qwen2.5-3B-Base to perform R1-Zero styled RL given the moderate model size.

Evaluation. We currently adopt the few-shot prompting evaluation for base language models and employ zero-shot evaluation for RL-tuned models.

1.2 Observations

Training dynamics comparison (downstream perf. and the average length of correct responses) between Llama-3.2-3B and Qwen2.5-3B. The dashed line indicates the few-shot evaluation performance and average length of correct response of corresponding base models.

Training dynamics comparison (downstream perf. and the average length of correct responses) between Llama-3.2-3B and Qwen2.5-3B. The dashed line indicates the few-shot evaluation performance and average length of correct response of corresponding base models.

During RL training on Llama-3.2-3B-Base and Qwen-2.5-3B-Base, we observed notably different and intriguing training dynamics regardless their performance. Specifically, the length of correct responses from Qwen increases steadily and reasonably throughout training, whereas Llama exhibits abnormal behavior—its average response length escalated dramatically, reaching up to 4,096 tokens.

Upon closer inspection of Llama’s output, we found that it typically begins with $\\boxed{xxx}$ , followed by extremely obvious repetition until hitting the max response length, in stark contrast to Qwen’s coherent and natural reasoning output. Post-RL evaluation further highlights the divergence: Qwen achieved substantial improvements over its base model across a wide spectrum of benchmarks, from simple to complex math reasoning tasks. Meanwhile, Llama experienced only marginal gains—or even regressions, as seen on GSM8K—likely due to the distributional gap between the RL training set (e.g., MATH8K) and GSM8K. The above observations motivate us to attribute the reason to their potential divergence of pre-training despite their opaque details.

These observations also further prompt a more fundamental question:

Can we intervene during the pre-training phase of Llama (e.g., via mid-training) to make it more amenable to RL scaling?

In this project, we explore a range of mid-training intervention strategies—methods that adjust the pre-training trajectory of LLMs—to examine their downstream impact on reinforcement learning. Our broader ambition is to rethink and reshape the pre-training paradigm in the context of the RL scaling era.

2. Digging Deeper: Exploring Key Factors through Controllable Experiments