Iteration 34. Multi-turn RL
18-10-2025
Goal
Implement a script to do multi-turn RL training, and test if it has a noticeable effect on model accuracy.
Motivation
On Iteration 28 I saw that the BARC induction model is not good at refining its predictions. That forces us to just make independent predictions with the model.
But that is not efficient, we should take into account previous predictions to avoid repeating errors and benefit from the execution feedback.
All the evolutionary test-time compute methods are based on the capability of the model to use feedback from execution.
Development
Unsloth GRPO does not support Iterable datasets
python scripts/multi-turn_rl_code_finetuning.py \
--epochs 1 \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-10-18-debug-multi-turn-RL/baseline
[rank0]: NotImplementedError: Iterable datasets are not yet supported in GRPOTrainer. Please use a standard dataset instead.
After changing from Dataset to IterableDataset I get this bad surprise.
Proof of concept with pre-generated responses
The easiest way to test the concept is to generate a dataset were I generate predictions for the task and pick one that is not correct. This is exactly the same I did in Iteration 28 but instead of doing it at test time, I need to do it at training time using training data.
So the best option would be to take BARC dataset and make predictions for the tasks.
Making 8 predictions for 1000 tasks takes around one hour on a single GPU. A good proof of concept will require between 10k and 20k prompts, at least that is what I'm currently training with RL before the training collapses.
Inference Dataset preparation
To prepare the dataset for inference I'm going to reuse the notebook notebooks/016_prepare_BARC_data_for_training.ipynb.
Inference
export PART=1
python scripts/inference_with_BARC.py \
--n-predictions 8 \
--dataset-path /mnt/hdd0/Kaggle/arc25/data/200k_HEAVY_gpt4o-description-gpt4omini-code_generated_problems/dataset_10k_part${PART}.json.gz \
--use-data-augmentation \
--output-folder /mnt/hdd0/Kaggle/arc25/predictions/2025-10-18-barc-inference/part${PART}
export PART=2
python scripts/inference_with_BARC.py \
--n-predictions 8 \
--dataset-path /mnt/hdd0/Kaggle/arc25/data/200k_HEAVY_gpt4o-description-gpt4omini-code_generated_problems/dataset_10k_part${PART}.json.gz \
--use-data-augmentation \
--output-folder /mnt/hdd0/Kaggle/arc25/predictions/2025-10-18-barc-inference/part${PART}
Dataset for 2nd turn conversation preparation
I have done the work on the already existing notebook notebooks/014_refine_solutions.ipynb.
The maximum prompt length is 8511, so I can keep the training parameters as they were.
Cluster experiments
export BETA=0.02
export MAX_GRAD_NORM=0.05
export REPETITION_PENALTY=1.02
export FOLDER=2025-10-19-multi-turn-rl
export LEARNING_RATE=4e-6
export NUM_GENERATIONS=32
export ACUM_STEPS=4
export N_CPUS=20
export LORA_R=1
export EPOCHS=1
export REWARD_NAME=arc-v2-no-pixel-score
export EXPERIMENT_NAME=${LORA_R}lora_lr${LEARNING_RATE}_${MAX_GRAD_NORM}max-grad-norm_${REWARD_NAME}_${NUM_GENERATIONS}gen_${ACUM_STEPS}accum-steps_repetition-penalty-${REPETITION_PENALTY}_masked-truncate_unquantized_beta${BETA}
condor_submit train.condor command="
python /mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/multi-turn_rl_code_finetuning.py \
--lora_r ${LORA_R} \
--beta ${BETA} \
--max-grad-norm ${MAX_GRAD_NORM} \
--no-load-in-4bit \
--reward-name ${REWARD_NAME} \
--num-generations ${NUM_GENERATIONS} \
--gradient-accumulation-steps ${ACUM_STEPS} \
--learning-rate ${LEARNING_RATE} \
--repetition-penalty ${REPETITION_PENALTY} \
--epochs ${EPOCHS} \
--mask-truncated-completions \
--scale-rewards batch \
--gpu_memory_utilization 0.3 \
--warmup-ratio 0.01 \
--max-seq-length 9700 \
--max-completion-length 1024 \
--n-jobs ${N_CPUS} \
--save-steps 200 \
--model-path /mnt/scratch/users/gbarbadillo/arc25/models/Llama-3.1-ARC-Potpourri-Induction-8B \
--dataset-path /mnt/scratch/users/gbarbadillo/arc25/data/barc/refine_dataset.json.gz \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/${FOLDER}/${EXPERIMENT_NAME}" -append request_gpus=1 -append request_cpus=${N_CPUS} -append request_memory=128G --append 'requirements = (TARGET.Machine == "calculon21.das-nano.com")'
# 245114.0
rsync -aPv -m --include='*/' --exclude *.pt --include='checkpoint-19699/***' --exclude='*' calculon01:/mnt/scratch/users/gbarbadillo/arc25/trainings/2025-10-19-multi-turn-rl /mnt/data/MEGA/TEMP/
Results
If we compare the base model doing 128 independent predictions against doing 64 predictions and refining the best ones that do not solve the task with the fine-tuned model, we see a small improvement in the metrics when evaluating on ARC-AGI-1 evaluation set.
| initial predictions | refinement predictions | valid code | valid outputs | unique outputs | train_pixel_score | train_correct_grids | train_pass_rate | train_is_correct | test_pixel_score | test_correct_grids | test_pass_rate | test_is_correct | is_correct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 128 | 0 | 99.9% | 71.7% | 49.8% | 42.1% | 2.4% | 1.6% | 16.3% | 40.9% | 2.0% | 2.0% | 23.0% | 16.3% |
| 64 | 64 | 93.7% | 76.5% | 42.5% | 48.7% | 3.1% | 1.6% | 18.5% | 47.2% | 2.5% | 2.4% | 24.5% | 17.8% |
The improvement is more clear if we compare against the previous refinement experiment.
| fine-tuned model | valid code | valid outputs | unique outputs | train_pixel_score | train_correct_grids | train_pass_rate | train_is_correct | test_pixel_score | test_correct_grids | test_pass_rate | test_is_correct | is_correct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no | 99.7% | 74.0% | 43.7% | 45.8% | 2.1% | 1.1% | 16.5% | 44.4% | 1.7% | 1.6% | 21.5% | 16.0% |
| yes | 93.7% | 76.5% | 42.5% | 48.7% | 3.1% | 1.6% | 18.5% | 47.2% | 2.5% | 2.4% | 24.5% | 17.8% |
The fine-tuning metrics look healthy, although we see at the end of the training that the model starts doing long predictions.
Conclusion
We have observed a small improvement (16.3% -> 17.8% pass-rate) when doing prediction refinement with a model fine-tuned with RL to do prediction refinement. If we had an stable RL training and enough time and compute, maybe this small improvement could be make bigger.
Next steps
- Is RL the best way to teach the model to refine its predictions? Maybe we should use supervised learning first, which has stronger learning signal.
TODO
- On a first step I have to modify the current RL script to train on a generator
- Create two datasets of 10k tasks from BARC
- Generate predictions to create a 2nd turn dataset for RL
- Prepare the dataset for training
- Train 2nd turn RL
- Evaluate using the same setup from Iteration 28
