Team: Zengzhi Wang, Fan Zhou, Xuefeng Li*, Pengfei Liu.
This is only a progress report. We are still actively working on it.
[🧑‍💻Github][🤗HF][🔖Poster]
<aside> 📌
Why could different model families (e.g., Llama, Qwen) produce significantly different training behavior during post-training (i.e., RL from Base model)? This blog explores how different early pre(mid)-training strategies’ could bring impact to post-training stages, especially during the period of Reinforcement Learning (RL). Qwen2.5-3B-Base and Llama3.2-3B-Base exhibit significantly different training dynamics and performance during RL.
Note that we are still in the process of exploring more possibilities and expand to different model families, but we are eager to share some findings with the community from our empirical results. We hold the hope of reshaping the pre-training stage of LLMs, in the era of RL scaling.
</aside>
Left: OctoThinker’s RL dynamics which greatly outperforms Llama-3.2 and matches the Qwen model; Right: pass@K coverage curve of OctoThinker and LLama’s Base and Zero series of models.
Deepseek R1-Zero demonstrates numerous powerful and interesting reasoning behaviors by directly performing large-scale reinforcement learning from base language models, i.e., Deepseek-V3-Base. Then, there are some follow-up works, such as SimpleRL and Open-Reasoner-Zero, to perform R1-Zero like training in smaller base language models (such as Qwen2.5-Math-7B, Qwen2.5-7B-Base, and Qwen2.5-32 B-Base), obtaining improved reasoning performance. Interestingly, existing successful R1-Zero like training replicates are mostly based on Qwen families whereas it seems challenging to replicate it on other popular general base language models, such as the Llama family, also recently evidenced by Gandhi et al., 2025 and Liu et al., 2025. This motivate us to perform preliminary study on representative models, also convenient for conducting controllable experiments to dive the reasons behind it.
RL Setup. We perform our RL experiments based on veRL framework and utilize the GRPO algorithm. We adopt MATH8K dataset as RL training prompts due to its moderate difficulty and concise composition. We set the global training batch size as 128, the number of rollout response per query as 16, PPO mini batch size as 64, sampling temperature as 1.0, max output length as 4096, learning rate as 1e-6, KL loss coef as 0 in veRL. We found that setting the ratio of sampling and gradient updating as 2 could clearly make RL training more steady. We employ a straightforward prompt template, “Question: {}\nAnswer: {}”, to format the training examples by default.
RL experimental configurations.
Base Model. We directly employ Llama-3.2-3B-Base and Qwen2.5-3B-Base to perform R1-Zero styled RL given the moderate model size.
Evaluation. We currently adopt the few-shot prompting evaluation for base language models and employ zero-shot evaluation for RL-tuned models.
Training dynamics comparison (downstream perf. and the average length of correct responses) between Llama-3.2-3B and Qwen2.5-3B. The dashed line indicates the few-shot evaluation performance and average length of correct response of corresponding base models.
During RL training on Llama-3.2-3B-Base and Qwen-2.5-3B-Base, we observed notably different and intriguing training dynamics regardless their performance. Specifically, the length of correct responses from Qwen increases steadily and reasonably throughout training, whereas Llama exhibits abnormal behavior—its average response length escalated dramatically, reaching up to 4,096 tokens.
Upon closer inspection of Llama’s output, we found that it typically begins with $\\boxed{xxx}$
, followed by extremely obvious repetition until hitting the max response length, in stark contrast to Qwen’s coherent and natural reasoning output. Post-RL evaluation further highlights the divergence: Qwen achieved substantial improvements over its base model across a wide spectrum of benchmarks, from simple to complex math reasoning tasks. Meanwhile, Llama experienced only marginal gains—or even regressions, as seen on GSM8K—likely due to the distributional gap between the RL training set (e.g., MATH8K) and GSM8K. The above observations motivate us to attribute the reason to their potential divergence of pre-training despite their opaque details.
These observations also further prompt a more fundamental question:
Can we intervene during the pre-training phase of Llama (e.g., via mid-training) to make it more amenable to RL scaling?
In this project, we explore a range of mid-training intervention strategies—methods that adjust the pre-training trajectory of LLMs—to examine their downstream impact on reinforcement learning. Our broader ambition is to rethink and reshape the pre-training paradigm in the context of the RL scaling era.