Iteration 8. Improve HER
10-05-2025
Goal
Can I optimize Hindsight Experience Replay (HER) to draw a chick?
Motivation
I have already probed that HER enables a model trained to solve tasks with up to 5 objects to solve a task with 25 squares, but can it solve arbitrary images?
The idea is to optimize the algorithm and the parameters so it is able to draw the chick of the image above. If I can do that I will be confident to the next step that will be extend the DSL. So far I have been unable to solve the chick task, an accuracy of around 97-98% is reached but perfect accuracy is eluding.
Development
I will be working on this notebook 006_HER_v2.
Train a more powerful base model
Improve gpu usage
I believe I can speedup the training just by using a bigger batch size per device.
python finetuning.py --output-dir /mnt/hdd0/Kaggle/arc25/trainings/20250511_optimize_GPU_sage/random_seed_5_no_dora --device-map auto --random-seed 5 --max-steps 50 --n-gpus 1 --per-device-train-batch-size 1 --batch-size 16 --max-seq-len 4096
python finetuning.py --output-dir /mnt/hdd0/Kaggle/arc25/trainings/20250511_optimize_GPU_sage/per-device-batch-size-8 --device-map auto --random-seed 5 --max-steps 20 --n-gpus 1 --per-device-train-batch-size 8 --batch-size 16 --max-seq-len 4096
python finetuning.py --output-dir /mnt/hdd0/Kaggle/arc25/trainings/20250511_optimize_GPU_sage/per-device-batch-size-4_2gpus --device-map auto --random-seed 5 --max-steps 20 --n-gpus 2 --per-device-train-batch-size 8 --batch-size 16 --max-seq-len 4096
# https://ironbar.github.io/arc24/modeling/Iteration_50_last_trainings/#steps-to-train-the-model
accelerate launch --num_processes 2 --num_machines 1 --mixed_precision bf16 --multi_gpu \
finetuning.py --output-dir /mnt/hdd0/Kaggle/arc25/trainings/20250511_optimize_GPU_sage/per-device-batch-size-8_2gpus_accelerate --device-map None --random-seed 5 --max-steps 40 --n-gpus 2 --per-device-train-batch-size 8 --batch-size 16 --max-seq-len 512
# one gpu
per-device-batch train_samples/s
1 6.15
2 10.1
4 16
8 17
# two gpus (not working yet)
Training command
Previously I trained the longest model for 6k steps, that would take just 1h15 with the new setup. So 16k would be around 3 hours and 32k would be around 6 hours.
I'm using the same LoRA configuration as the previous trainings.
export CUDA_VISIBLE_DEVICES=0
python finetuning.py --output-dir /mnt/hdd0/Kaggle/arc25/trainings/20250511_longer_trainings/32k_steps \
--device-map auto \
--max-steps 32000 \
--n-gpus 1 \
--per-device-train-batch-size 8 \
--batch-size 16 \
--max-seq-len 512 \
--logging-steps 100 \
--save-steps 1000 \
--lora-r 32 \
--use-dora \
--use-rslora
Improvements to the algorithm
- When multiple new tasks share the same output, choose the task with the shortest code
- Do not train multiple times on the same tasks
- Add wanbd with images for logging
Results
The updated implementation is able to solve consistently all the tasks from iteration 5. For example it solves the 25-squares task in 10-12 minutes whereas that took 15 minutes in the previous implementation and I believe it was not as consistent.
Moreover I have been able to solve the chick task, but not consistently. I believe that I might need a better model to solve the task consistently, because it does not require more lines of code, but higher precision.
Using a model trained for 32k steps
Can a model trained for 32k steps (instead of 6k steps) solve the chick task consistently?
The table below shows the number of epochs needed to solve each task.
| task | 6k steps model | 32k steps model |
|---|---|---|
| 9-vertical-lines | 2 | 2 |
| 12-squares | 4 | 2 |
| 16-squares | 7 | 4 |
| 20-squares | 7 | 7 |
| 25-squares | 10 | 11 |
In general it seems that the model trained for longer solves the task earlier as expected.
However it does not solve the chick task consistently. After inspecting the best predictions for each epoch I see that it had all the details right but never on the same epoch.
Exploration and exploitation
Maybe I have to use both high temperature for exploration and low temperature for exploitation.
inference_params=[
InferenceParams(num_return_sequences=8, temperature=0.1),
InferenceParams(num_return_sequences=128, temperature=0.9),
]
When using just high temperature the success rate of the chick task was 2/9, after combining exploration and exploitation it increased to 4/5.
Thus we could say that we have achieved the goal of solving consistently the chick task.
Number of generations
It might be better to use even smaller number of generations, because that would make the policy to change more smoothly. My current implementation is not efficient for small batch sizes, but a future implementation might be.
The optimal number of generations per epoch depends on the complexity of the task. It might seem that 16 could be a good choice because works well in all the cases. It requires between 2 and 5 less generation than generating 128 predictions per epoch.
We could thought that the best option was to do a single prediction and generation per epoch, but that does not seem to be the case considering that 4 generations per epoch needs more generations than 16 generations per epoch. Or maybe is a problem with my implementation.
The problem is that with my current implementation using large batch sizes is much more efficient, so despite this plots it is faster to make 128 generation per epoch.
Visualization of solving the chick task
Conclusion
On this iteration I have improved the HER algorithm and solved consistently the chick task. I have seen that combining low and high temperature when sampling is helpful to be consistent, that is the old exploration-exploitation dilemma.
Next steps
- Start working with ARC tasks
- Make the script work with accelerate
- There might be a problem with the train dataset, the generator function is called 4-5 times. This might require to set random seed to None.



