- [2025-10-21] 📢 Our work is reported by Synced (机器之心)!
- [2025-10-10] ✨ Code is now available.
- [2025-09-30] 📄 Our paper is released on arXiv.
Clone the repository:
git clone https://github.com/RyanLiu112/AttnRL.git
cd AttnRLCreate a new conda environment and install the dependencies:
conda create -n attnrl python=3.10
conda activate attnrl
bash scripts/install_vllm_sglang_mcore.shThe training dataset (DeepScaleR-Preview-Dataset) is at data/train/deepscaler_train.parquet, which contains 40.3k mathematical reasoning data.
The evaluation datasets are in data/eval/ and the suffix _${K} indicates the number of duplicate samples for each question.
For training AttnRL with DeepSeek-R1-Distill-Qwen-1.5B backbone on 8 H100 GPUs, run:
bash recipe/attnrl/run_attnrl_r1_distill_1.5b_8k.shEvaluation scripts are the same as the training scripts. +trainer.val_only=True should be added to perform evaluation only. We recommend setting data.max_prompt_length=2048 and data.max_response_length=32768.
If you find this work helpful, please kindly cite our paper:
@article{AttnRL,
title = {Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models},
author = {Liu, Runze and Wang, Jiakang and Shi, Yuling and Xie, Zhihui and An, Chenxin and Zhang, Kaiyan and Zhao, Jian and Gu, Xiaodong and Lin, Lei and Hu, Wenping and Li, Xiu and Zhang, Fuzheng and Zhou, Guorui and Gai, Kun},
journal = {arXiv preprint arXiv:2509.26628},
year = {2025}
}Our code is based on verl (commit) and TreeRL. Our training dataset is from DeepScaleR-Preview-Dataset and rule-based verifier is based on Skywork-OR1, and Archer.