Iteration 10. Try to solve real ARC tasks
22-05-2025
Goal
Can I solve real ARC tasks with code and HER?
Motivation
I have already seen that HER allows to generalize to novel toy tasks, I need to check if it can solve real ARC tasks.
I know that my primitive functions defined on ARC24 solved 285 training tasks. So probably the easiest path is to review those transformations add them and modify if needed.
Development
Add safety and determinism checks
Inspired by Absolute Zero Reinforced Self-play Reasoning with Zero Data I'm going add safety and determinism checks.
Generation functions
LLM are incredible useful to write generation functions. For example I have asked o3 to write a function to create ARC images with random objects and it worked perfectly.
Stats about current implementation
Found 23 training tasks
There are 17 DSL functions defined in arc25.dsl:
DSL functions used in 1000 tasks:
detect_objects 621 times
draw_object 425 times
create_img 395 times
crop 123 times
draw_rectangle 122 times
draw_vertical_line 111 times
draw_horizontal_line 98 times
draw_line 94 times
draw_pixel 91 times
mode 49 times
apply_colormap 46 times
downscale 33 times
rotate_90 32 times
flip 28 times
upscale 27 times
pad 17 times
trim 15 times
There are 13 DSL attributes defined in arc25.dsl:
DSL attributes used in 1000 tasks:
change_color 380 times (Object)
area 89 times (Object)
height 72 times (BoundingBox, Object)
width 66 times (BoundingBox, Object)
is_horizontal_line 48 times (Object)
is_rectangle 46 times (Object)
move 45 times (Object)
is_square 42 times (Object)
is_vertical_line 40 times (Object)
center 38 times (Object)
is_line 32 times (Object)
is_point 31 times (Object)
copy 0 times (Object)
This is clearly not enough, but I want to train a model on these tasks and see if it can solve any of the ARC training tasks.
Training
Local experiments
Click to see the bash commands
export N_GPUS=2
export PARAMETERS=0.5B
export STEPS=10
export MAXSEQLEN=3072
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu \
scripts/finetuning.py \
--model_path /home/gbarbadillo/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-06-10-first-real-trainings/3090-GPUS${N_GPUS}-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${MAXSEQLEN}msl \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--per-device-train-batch-size 2 \
--per-device-eval-batch-size 4 \
--batch-size 32 \
--max-seq-len ${MAXSEQLEN} \
--logging-steps 100 \
--eval-steps 0 \
--save-steps 1000 \
--lora-r 32 \
--use-dora \
--use-rslora \
--no-resume_from_checkpoint
export MAXSEQLEN=8192
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu \
scripts/finetuning.py \
--model_path /home/gbarbadillo/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-06-10-first-real-trainings/3090-GPUS${N_GPUS}-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${MAXSEQLEN}msl \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 2 \
--batch-size 32 \
--max-seq-len ${MAXSEQLEN} \
--logging-steps 100 \
--eval-steps 0 \
--save-steps 1000 \
--lora-r 32 \
--use-dora \
--use-rslora
export N_GPUS=1
export CUDA_VISIBLE_DEVICES=0
export MAXSEQLEN=8192
python scripts/finetuning.py \
--model_path /home/gbarbadillo/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-06-10-first-real-trainings/3090-GPUS${N_GPUS}-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${MAXSEQLEN}msl \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 2 \
--batch-size 32 \
--max-seq-len ${MAXSEQLEN} \
--logging-steps 100 \
--eval-steps 0 \
--save-steps 1000 \
--lora-r 32 \
--use-dora \
--use-rslora
It is training around 2.66s/it, task generation does not seem to be the bottleneck. At the beginning of the training used multiple cores to sample, but then CPU usage was low. Probably the queue was filled.
# --max-seq-len 3072
{'train_runtime': 298.5646, 'train_samples_per_second': 10.718, 'train_steps_per_second': 0.335, 'train_loss': 0.24244340896606445, 'epoch': 1.0}
# export MAXSEQLEN=6144
{'train_runtime': 363.6891, 'train_samples_per_second': 8.799, 'train_steps_per_second': 0.275, 'train_loss': 0.23476869583129883, 'epoch': 1.0}
# export MAXSEQLEN=8192
2025-06-11 06:32:08,621 - __main__ - INFO - log_prompt_length_percentiles - train number of prompts: 1000, max number of tokens : 5108, percentiles: {50: 1249, 75: 1567, 90: 1960, 95: 2069, 97: 2139}
{'train_runtime': 374.4698, 'train_samples_per_second': 8.545, 'train_steps_per_second': 0.267, 'train_loss': 0.23707759857177735, 'epoch': 1.0}
export N_GPUS=1
export CUDA_VISIBLE_DEVICES=0
export MAXSEQLEN=8192
{'train_runtime': 671.3956, 'train_samples_per_second': 4.766, 'train_steps_per_second': 0.149, 'train_loss': 0.23653331756591797, 'epoch': 1.0}
To be safe I should probably use max-seq-len=8192, otherwise we will be missing some training tasks.
Cluster
export N_GPUS=2
export PARAMETERS=0.5B
export LEARNING_RATE=2e-4
export STEPS=32000; condor_submit train.condor command="
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu \
/mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/finetuning.py \
--model_path /mnt/scratch/users/gbarbadillo/arc25/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-13-first-real-trainings/${N_GPUS}xA6000-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${LEARNING_RATE}lr \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--learning-rate ${LEARNING_RATE} \
--per-device-train-batch-size 2 \
--per-device-eval-batch-size 4 \
--batch-size 32 \
--max-seq-len 8192 \
--logging-steps 10 \
--eval-steps 50 \
--save-steps 200 \
--lora-r 32 \
--use-dora \
--use-rslora" -append request_gpus=${N_GPUS} -append request_cpus=12
export N_GPUS=2
export PARAMETERS=0.5B
export STEPS=32000
export LEARNING_RATE=4e-5; condor_submit train.condor command="
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu \
/mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/finetuning.py \
--model_path /mnt/scratch/users/gbarbadillo/arc25/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-14-full-fine-tuning/${N_GPUS}xA6000-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${LEARNING_RATE}lr \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--learning-rate ${LEARNING_RATE} \
--per-device-train-batch-size 2 \
--per-device-eval-batch-size 4 \
--batch-size 32 \
--max-seq-len 6144 \
--logging-steps 10 \
--eval-steps 50 \
--save-steps 200 \
--no-use-lora" -append request_gpus=${N_GPUS} -append request_cpus=12
export N_GPUS=1
export PARAMETERS=0.5B
export STEPS=1000
export LEARNING_RATE=4e-5; condor_submit train.condor command="
python \
/mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/finetuning.py \
--model_path /mnt/scratch/users/gbarbadillo/arc25/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-13-first-real-trainings/${N_GPUS}xA6000-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${LEARNING_RATE}lr \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--learning-rate ${LEARNING_RATE} \
--per-device-train-batch-size 2 \
--per-device-eval-batch-size 4 \
--batch-size 32 \
--max-seq-len 8192 \
--logging-steps 10 \
--eval-steps 50 \
--save-steps 1000 \
--lora-r 32 \
--use-dora \
--use-rslora" -append request_gpus=${N_GPUS}
# Copy the results to MEGA
rsync -P -r calculon01:/mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-13-first-real-trainings /mnt/data/MEGA/TEMP --exclude wandb/*
Debugging
Click to expand/collapse this section
export N_GPUS=2
export PARAMETERS=0.5B
export STEPS=10
export MAXSEQLEN=8192
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu \
scripts/finetuning.py \
--model_path /home/gbarbadillo/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-06-10-first-real-trainings/3090-GPUS${N_GPUS}-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${MAXSEQLEN}msl \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 2 \
--batch-size 32 \
--max-seq-len ${MAXSEQLEN} \
--logging-steps 100 \
--eval-steps 0 \
--save-steps 1000 \
--lora-r 32 \
--use-dora \
--use-rslora \
--no-resume_from_checkpoint
export N_GPUS=1
export PARAMETERS=0.5B
export STEPS=10
export MAXSEQLEN=8192
python \
scripts/finetuning.py \
--model_path /home/gbarbadillo/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-06-10-first-real-trainings/3090-GPUS${N_GPUS}-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${MAXSEQLEN}msl \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 2 \
--batch-size 32 \
--max-seq-len ${MAXSEQLEN} \
--logging-steps 100 \
--eval-steps 0 \
--save-steps 1000 \
--lora-r 32 \
--use-dora \
--use-rslora \
--no-resume_from_checkpoint \
--dataloader-num-workers 0
# this works --dataloader-num-workers 0
# this does not: --dataloader-num-workers 1
# it is unrelated from: os.environ['TOKENIZERS_PARALLELISM'] = 'true'
I'm seeing a new error on the cluster.
AttributeError: Can't pickle local object 'SFTTrainer._prepare_dataset.<locals>.add_eos'
# local libraries
accelerate 1.6.0 pypi_0 pypi
torch 2.6.0 pypi_0 pypi
transformers 4.51.3 pypi_0 pypi
datasets 3.5.1 pypi_0 pypi
trl 0.18.0.dev0
# cluster libraries (experiments run on docker)
accelerate==1.7.0
torch==2.6.0
transformers==4.52.4
datasets==3.6.0
trl==0.18.1
# local experiments updating library versions
trl==0.18.0 -> works
trl==0.18.1 -> works
accelerate==1.7.0 -> works
transformers==4.52.4 -> works
datasets==3.6.0 -> works
# adding this line at the start of the script reproduces the problem locally
import multiprocessing as mp
mp.set_start_method("spawn", force=True)
> [rank0]: AttributeError: Can't pickle local object 'SFTTrainer._prepare_dataset.<locals>.add_eos'
# adding this other line to see what it is printed
import multiprocessing as mp, os
print(">>> multiprocessing start-method:", mp.get_start_method(), "PID:", os.getpid())
# local response
>>> multiprocessing start-method: fork PID: 19840
>>> multiprocessing start-method: fork PID: 19841
# cluster response
>>> multiprocessing start-method: fork PID: 57
>>> multiprocessing start-method: fork PID: 58
# adding this line at the start does not solve the problem in the cluster
mp.set_start_method("fork", force=True)
- https://github.com/pytorch/pytorch/blob/v2.7.0/torch/utils/data/dataloader.py#L173
- https://docs.pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
- https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
- https://wandb.ai/guillermobarbadillo/2025-05-21-model-size/runs/3g8opphj/files/requirements.txt, on the last successful training in the cluster I used trl=0.17.0
Trying with trl=0.17.0 on job 204192. No success.
What is the problem? The problem seems to be that pickle cannot work with functions defined inside functions.
AttributeError: Can't pickle local object 'SFTTrainer._prepare_dataset.<locals>.add_eos'
AttributeError: Can't pickle local object 'truncate_dataset.<locals>.truncate'
So I have changed how the dataset is generator to avoid entering in that 'SFTTrainer._prepare_dataset./home/gbarbadillo/miniconda3/envs/arc25-clone/lib/python3.10/site-packages/torch/utils/data/dataloader.py
multiprocessing_context=None,
multiprocessing_context='fork',
sed -i.bak "0,/multiprocessing_context[[:space:]]*=[[:space:]]*None,/s//multiprocessing_context='fork',/" \
/home/gbarbadillo/miniconda3/envs/arc25-clone/lib/python3.10/site-packages/torch/utils/data/dataloader.py
sed -i.bak "0,/multiprocessing_context[[:space:]]*=[[:space:]]*None,/s//multiprocessing_context='fork',/" \
/mnt/scratch/users/gbarbadillo/arc25/cached-environments/venv_07bdecf0b823319f4d2fcbe9cdc354d9/lib/python3.10/site-packages/torch/utils/data/dataloader.py
Results
Training Hyperparameters
For a batch size of 32 and lora rank of 32 a learning rate of 2e-4 seems to be good. 1e-3 is too much, 4e-4 also works but for longer trainings 2e-4 might be a better option.
Since I only have 23 training tasks, I don't expect to see relevant improvements by using a batch size bigger than 32. So I'm not going to do experiments with the batch size.
Wandb experiment, filter by 1000steps.
For longer trainings 1e-4 might be a better option.
Fine-tuning model capacity
The following table shows the validation loss for trainings of different length.
| steps | LoRA 32 | full fine-tuning |
|---|---|---|
| 8k | 0.0198 | 0.0178 |
| 16k | 0.0165 | 0.0155 |
| 32k | 0.0149 | 0.0123 |
The full fine-tuning achieves a lower validation loss, while being faster. F.e. training for 32k steps took 24h when doing the full fine-tuning and 30h when using LoRA.
I have seen that at the end of the training learning stops, but there is not a quick fix to this because the schedulers don't have a minimum learning rate value.
Finally the full fine-tuning required a slightly lower learning rate (4e-5 vs 1e-4).
Trying to solve real ARC tasks
08ed6ac, does not understand that the task is about sorting and changing colors0b148d6, Understands that the task is about detecting objects and cropping, but does not know to use the color0ca9ddb, seems to understand that the task is about drawing pixels, but it does not use the center as a reference. Neither it uses the color to select certain objects0d3d703e, the model does not recognize that the task is about apply_colormap. Create more tasks showing how to change colors, not changing all the colors always.178fcbfb, does understand that the task is about drawing horizontal and vertical lines, but does not know to use the center as a reference1bfc4729, understands that it needs to draw some pattern, but does not have a way to make a different drawing for each image1c786137, does not understand that the task is about selecting the object
It seems that the main problem is that the model does not have a good intuition of how to solve the tasks. It simply does not use the correct primitive function in some cases, in other cases it is missing examples of how to use them. If the initial direction is not correct, it is not possible that HER can help to achieve the desired goal.
I might introduce diversity in the generations by suggesting to use some DSL primitive functions.
Conclusion
I have run the first experiments to try to solve real ARC tasks. So far no task has been solved, but I have evaluated just a tiny subset of the training ARC tasks. On the following iteration I would like to solve at least the 7 tasks analyzed on this iteration.
Next steps
- Hypothesis: If I implement a DSL that covers the whole training and evaluation set, it should generalize to the test set.
TODO
- Add safety and determinism checks
- Add more primitive functions and training tasks to learn to use them
- I would like to have a list of all the primitive functions from the DSL, and how many times are they used in the training tasks. A correlation plot would also be nice to see which connections are missing.
- Is the sampling speed enough?
- Stats about the input tokens distribution, what should be the max-seq-len?
- Optimize learning rate and batch size for 2 GPUs.
- Create a notebook to evaluate the trained models on real ARC tasks
- I need a way to do evaluation at scale, using multiple GPUs, and saving all the generated tasks when searching for a solution.
- If possible I should use Kaggle compute for evaluation. It is almost free and is a good way to store and visualize results.
- When studying how the method is working on real ARC tasks, I believe I should reuse the DSL analysis from the training tasks. That way I can see if it is using the right abstractions, and if it combining them correctly
- Compositionality, can the model solve the task that selects the biggest object, crop and trim? That would be a good example of compositionality because those functions were not used together in the dataset
- Sequential solving. Try also solving the tasks in multiple steps, not just once. It could help with compositionality.