top of page
Search

Training-Efficient RL

Current Reinforcement Fine-tuning (RFT) methods for Large Language Models (LLMs) are notorious for their low sample efficiency and high computational costs, often consuming hundreds of GPU hours for relatively few training steps. Existing approaches attempt to mitigate this through curriculum learning based on heuristic difficulty metrics, but these methods fail to utilize the intrinsic learning signals generated by the model itself.


A new article[1] details a novel approach, GAIN-RL (Gradient-driven Angle-Informed Navigated RL), which leverages an intrinsic model signal called angle concentration. Angle concentration, defined by the angular distribution of token hidden state vectors, is theoretically and empirically demonstrated to correlate directly with the resulting gradient strength and the model's capacity to learn from specific data. By dynamically selecting data based on this signal, GAIN-RL ensures consistently impactful gradient updates, achieving over a 2.5x acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. Furthermore, GAIN-RL demonstrates superior data efficiency, achieving better performance using only half of the original training data compared to vanilla methods using the full dataset.

Overview of GAIN-RL
Overview of GAIN-RL

The Persistent Challenge of Reinforcement Fine-Tuning


The emergence of RFT techniques [3], such as those exemplified by Deepseek-R1 and OpenAI’s O1, has significantly enhanced the performance of LLMs in complex domains like mathematical reasoning and code generation. This success has spurred the development of various algorithmic optimizations, including GRPO[4], ReMax[5], and Reinforce++[6].


Despite this progress, RFT remains hindered by two fundamental limitations: low sample efficiency and prohibitively high computational costs. For example, the GRPO fine-tuning phase on models like Qwen 2.5-7B can consume roughly 240 GPU hours just to complete 100 steps over 8,000 samples. This inefficiency raises the question of whether continuous rote repetition of every data point is necessary.


Prior efforts to accelerate LLM training have focused on data manipulation techniques, categorized into sample selection (e.g., LIMO, S1) and data ordering strategies (e.g., ADARFT [2]). However, these existing methods suffer from critical drawbacks:


  1. Neglect of Intrinsic Model Characteristics: They rely on fixed, model-agnostic criteria like difficulty or diversity. Different models interpret the same problem differently, meaning a "one-size-fits-all" difficulty measure leads to suboptimal training outcomes.


  2. High Data Preprocessing Costs: Methods like S1[7] and LIMO[8] require running large-scale models across entire datasets to compute scores, or curriculum learning relies on manual or expert-defined labels, limiting scalability and responsiveness. ADARFT, for instance, requires difficulty coefficients that may not be available for certain datasets.


To overcome these challenges, a search was initiated for a model-informed signal that reflects the learning capacity of a specific model on particular data, incurs minimal computational costs, and maintains generalizability across diverse models and datasets.


Angle Concentration: The Intrinsic Model Signal


The core innovation of this research lies in identifying the angle concentration signal as the key to unlocking training efficiency. This signal is derived from the angular relationship between token hidden states during the inexpensive pre-filling stage, a single forward pass, rather than the computationally costly decoding stage.


Theoretical Justification via Gradient Norm


Model training is fundamentally driven by incremental gradient updates. By reformulating the Frobenius norm of the gradient with respect to a weight matrix W, the explicit relationship between hidden states and gradients is revealed:

reformulating the Frobenius norm

where x_i and x_j are the hidden state vectors of tokens i and j, and \cos \theta_{i,j} is the cosine similarity (related to the angle) between them. This equation explicitly demonstrates that both the magnitudes of token hidden states and the angles between them directly influence the resulting gradient values during backpropagation.

Since the magnitudes of token hidden states are typically normalized within each layer during inference, they fail to convey useful differentiating information. Consequently, the research focuses specifically on the relative angles between token hidden states. The theoretical insight derived is that the more concentrated the angles between tokens, the larger the gradient norm.


Furthermore, the nonlinear transformations inherent in LLM inference, including attention mechanisms and activation functions, are fundamentally angle-dependent and continuously modify the angles among token hidden states.


Uncovering Model Learning Dynamics: The Three Patterns


To effectively leverage angle concentration, the study investigated how this signal evolves and what characteristics define it. This led to the discovery of three distinct patterns:


1. Layer-wise Angle Concentration Pattern


Experiments conducted on models like Qwen2.5-0.5b-Instruct revealed that the angular distribution evolves systematically as the hidden states pass through the model layers.


  • In initial layers, the angles are primarily determined by the input embeddings, showing no distinct pattern.

  • As depth increases, the angles develop a segmented structure, where tokens within the same input segment (e.g., system prompt, few-shot examples, question) cluster more closely, this is intra-segment angle concentration.

  • Finally, in the final layers, the angles between tokens from different segments begin to converge, reaching the highest degree of concentration, this is inter-segment angle concentration.

This process demonstrates that the model first induces intra-segment concentration and subsequently promotes inter-segment concentration, facilitating collaborative information propagation. The final layer is crucial because inter-segment clustering is maximal there, resulting in the highest overall concentration.

These concentrations are formally measured by:

concentrations formula

 where C_{intra} measures concentration within the question tokens, and C_{inter} measures concentration between the question tokens and the prompt/few-shot tokens.


An attention-based explanation suggests that tokens with higher angle concentration correspond to higher attention scores. C_{intra} reflects the strength of attention within the question itself, while C_{inter} indicates the model's ability to follow instructions. The phenomenon of sink attention, where attention scores peak at the first token of a segment, further encourages this concentration.


2. Epoch-wise Angle Concentration Pattern


By monitoring angular concentrations throughout training, the study observed that both inter-segment (C_{inter}) and intra-segment (C_{intra}) angle concentrations progressively increase across training epochs. This convergence reinforces that angular concentration effectively mirrors training dynamics.

They train Qwen2.5-0.5B-Instruct and LLaMA3.2-1b-Instruct on GSM8K using GRPO for 250 epochs
They train Qwen2.5-0.5B-Instruct and LLaMA3.2-1b-Instruct on GSM8K using GRPO for 250 epochs

Interestingly, the intra-question concentration (C_{intra}) initially decreases before subsequently increasing. This suggests that the model initially prioritizes mastering instruction-following capabilities (reflected by C_{inter}) before refining its internal focus on individual questions.


3. Data-wise Angle Concentration Pattern


Crucially for efficient data scheduling, tracking the model’s performance on samples with varying angle concentrations revealed a curriculum-like trend: the model tends to prioritize learning from higher-angle concentration data before addressing lower-angle concentration data. For example, after 100 epochs, questions with maximal angular measurements were often answered correctly, while those with smaller initial angles remained uncorrected.


This pattern is explained through two lenses:


  • Gradient-based: When losses are relatively uniform early in training, samples with higher angle concentration receive stronger gradients and are learned faster. As high-angle samples are mastered (reducing their loss), low-angle samples inherit larger relative gradients and are learned next. This sequence creates a natural, angle-driven learning progression that is intuitive and model-centric, unlike traditional difficulty metrics.


  • Neuron-based: Tokens exhibiting higher angle concentration tend to activate similar neurons due to shared value patterns. Neurons that are commonly activated receive stronger cumulative gradients , leading to more effective training. The activation patterns converge during training, forming a distinct cluster correlated with high accuracy, suggesting this cluster encodes domain-specific knowledge.


Based on these findings, the third major insight is: Training should follow the model’s inherent dynamics, prioritizing higher-angle data early and gradually transitioning to lower-angle data to ensure effective gradient updates and improved efficiency.


The GAIN-RL Framework


Built upon the realization that the model’s intrinsic angle concentration signals reflect its learning priorities, the authors propose GAIN-RL (Gradient-driven Angle-Informed Navigated RL). This framework is designed as a plug-and-play acceleration solution compatible with any model and dataset, incurring negligible costs. GAIN-RL consists of three primary components (Algorithm 1):


1. Data Reordering Based on Angular Concentration (Step 1)


Before training commences, the data is reordered to reflect the model's preferred learning sequence.


  1. The model performs a single pre-filling pass on all data samples to collect angular information.

  2. The data is sorted in descending order based on the combined angle signal at the final layer.

  3. The sorted dataset D_s is then used for subsequent training.

This sorting process is extremely computationally efficient. For instance, sorting approximately 7,000 samples of GSM8K using Qwen-2.5-0.5-instruct took less than 10 minutes on a single NVIDIA A100 GPU. This contrasts sharply with previous methods requiring manual annotation or large model generation, often consuming several days.


2. Data Sampling Guided by Gaussian Probability (Step 2)


During training, data sampling is governed by a dynamic Gaussian distribution P_t parameterized by mean μ_t and variance \sigma_t. This distribution is applied over the sorted dataset D_s, ensuring that data points with higher angular concentration (closer to the start of D_s, or index i=0) are consistently prioritized in early epochs.

dynamic Gaussian distribution

Probabilistic sampling, rather than strictly sequential sampling, is employed to enhance the stability and robustness of the training process.


3. Dynamic Probability Update


The Gaussian mean μ_t dictates the region of peak sampling. It is initialized at μ_0 = 0 to target high-angle concentration data. As the model masters these samples, μ_t gradually increases, shifting the focus towards harder, lower-angle concentration data.

The update rule for μ_{t+1} incorporates real-time batch metrics: mean accuracy Acc^(t) and mean angle concentration C^(t) :


update rule for μ

Here, β is the target accuracy (set to 0.5 to maintain strong gradients), and α and γ are sensitivity parameters (tuned as α=2 and γ=0.5). This dynamic strategy maintains high-gradient training by ensuring that the model targets samples near the desired accuracy while efficiently incorporating progressively harder (lower-angle) data.


Crucially, computing these signals (Acc^(t) and C^(t)) requires no additional cost, as model inference is performed inherently during training.


Empirical Validation: Efficiency and Generalization


The effectiveness of GAIN-RL was rigorously evaluated across five domains, including training efficiency, data efficiency, RL algorithm generalization, single-task performance, and ablation studies. Experiments utilized the DeepScaleR and DeepCoder datasets for training on math and code tasks, respectively, and tested across standard benchmarks (e.g., GSM8K, MATH, LivecodeBench, etc.).


Training Acceleration


GAIN-RL demonstrated significant hardware efficiency gains, consistently outperforming vanilla GRPO and the state-of-the-art curriculum learning baseline, ADARFT. Hardware efficiency was measured by the number of epochs required to match vanilla GRPO’s 200-epoch performance (Epo@Same Acc).

Model (Math Tasks)

Method

Avg Epo@Same Acc

Speed Up

Qwen 2.5 Math 1.5B Instruct

GRPO

200

1x

Qwen 2.5 Math 1.5B Instruct

ADARFT(GRPO)

150

1.33x

Qwen 2.5 Math 1.5B Instruct

GAIN-RL(GRPO)

80

2.50x

Qwen 2.5 Math 7B Instruct

GAIN-RL(GRPO)

70

2.86x

Overall, GAIN-RL(GRPO) achieved over a 2.5x acceleration across various models and tasks. For the Qwen2.5-Math-7B-Instruct model, GAIN-RL reached the performance target in just 70 epochs, achieving a speedup of 2.86x. In code generation tasks (Qwen 2.5 Coder 3B Instruct), GAIN-RL(GRPO) achieved an 1.81x speedup. This acceleration stems from the method’s ability to maintain strong gradient signals throughout training, leading to faster convergence and superior performance at every epoch.

Learning Dynamics of Different Methods on (Left) Qwen2.5-Math-1.5b-Instruct and (Right) LLaMA-3.2-3b-Instruct
Learning Dynamics of Different Methods on (Left) Qwen2.5-Math-1.5b-Instruct and (Right) LLaMA-3.2-3b-Instruct

When evaluated on single tasks, GAIN-RL also showed consistent superiority. On GSM8K, GAIN-RL(GRPO) achieved a remarkable 3.33x training speedup and a 4.72% improvement in final accuracy (53.15% vs 48.43% for GRPO).


Data Efficiency


To test the data effectiveness, the Qwen2.5-0.5b-Instruct model was fine-tuned using only half of the Math dataset under three sampling strategies: Uniform, High Angular Concentration-biased, and Low Angular Concentration-biased.


The results were striking: the model trained with High Angular Concentration-biased Sampling outperformed the model trained with the full dataset. This phenomenon is hypothesized to occur because low-angle concentration data tends to produce smaller gradient updates and activates dispersed neural regions; excluding this data significantly enhances training efficiency and data effectiveness. In contrast, uniform sampling on half the data performed on par with vanilla GRPO on the full dataset, while low-angle biased sampling led to unstable and poor results.

Data Efficiency Analysis of GAIN-RL. They sampled half of the data from the math dataset using three distinct sampling methods to train the Qwen2.5-0.5b-instruct model using GAIN- RL(GRPO)
Data Efficiency Analysis of GAIN-RL. They sampled half of the data from the math dataset using three distinct sampling methods to train the Qwen2.5-0.5b-instruct model using GAIN- RL(GRPO)

This finding provides new guidance for data selection in RFT: prioritizing high angular concentration data and discarding low angular concentration data can dramatically improve data efficiency.


Generalization Across RL Algorithms


To ensure broad applicability, GAIN-RL was combined with PPO (Proximal Policy Optimization). Results demonstrated substantial gains in both performance and hardware efficiency. GAIN-RL(PPO) achieved an average of 2.2x training speedup across three mathematical benchmarks (GSM8K, MATH, AMC 23). This consistency validates GAIN-RL’s universal acceleration capability because gradient updates are fundamental across various RL algorithms.


Ablation Studies and Scalability


A thorough ablation study confirmed that the primary performance advantages of GAIN-RL stem from the synergy between data ordering and dynamic

sampling/probability updates.

  • Variants using only accuracy-based or angle-based probability updates, while better than vanilla GRPO, did not match the full GAIN-RL framework, confirming that the combined signal captures multiple crucial aspects of the gradient.

  • An "Accuracy-only" group, which discarded fully correct data (a technique sometimes used to reduce costs), showed a marked performance drop and decline due to "forgetting" from prematurely discarding data, underscoring the necessity of angle-informed data scheduling.


Furthermore, GAIN-RL demonstrated robustness and stability in the Small-Batch Scalability Test. Even when the training batch size was reduced by half (from 1024 to 512), GAIN-RL maintained performance comparable to vanilla GRPO trained with the full batch size. This scalability is beneficial for environments with limited computational or memory resources, offering flexibility to accelerate training by lowering batch size with only minor performance degradation.


Conclusion and Future Directions


The GAIN-RL framework introduces a novel, model-centric paradigm for Reinforcement Learning Fine-tuning. By leveraging the model’s intrinsic angle concentration signal, GAIN-RL dynamically selects training data to ensure consistently impactful gradient updates, significantly enhancing overall training efficiency. Empirical evidence validates its capacity to achieve over 2.5x acceleration and superior performance with reduced datasets.


The discovery that angle concentration fundamentally mirrors information propagation and learning dynamics suggests that this angle-based signal can be generalized beyond RFT. Future work is planned to investigate leveraging this signal in other domains, such as pre-training (evaluating learning capacity across domains in real-time) and inference (tracking angle changes to assess model comprehension and suggest test-time adjustments). GAIN-RL represents a critical step towards remedying the sub-optimality of current RFT methods through highly effective, model-informed data processing.


References


[1] Wang, Q., Ke, J., Ye, H., Lin, Y., Fu, Y., Zhang, J., ... & Chen, Y. (2025). Angles Don't Lie: Unlocking Training-Efficient RL Through the Model's Own Signals. arXiv preprint arXiv:2506.02281.


[2] Taiwei Shi, Yiyang Wu, Linxin Song, Tianyi Zhou, and Jieyu Zhao. Efficient reinforcement finetuning via adaptive curriculum learning. arXiv preprint arXiv:2504.05520, 2025.



[4] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.


[5] Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo.


[6] Jian Hu. Reinforce++: A simple and efficient approach for aligning large language models. arXiv preprint arXiv:2501.03262, 2025.


[7] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025.


[8] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi,Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. https://arxiv.org/pdf/2501.19393 arXiv preprint arXiv:2501.19393, 2025.

Comments


bottom of page