Iteration 12. Solve a few ARC tasks

17-06-2025

Goal

Probe that I can solve a few selected ARC tasks by using an LLM to write code.

Motivation

On the previous Iteration 10 I tried to solve a few ARC tasks without success: 08ed6ac7, 0b148d64, 0ca9ddb6, 0d3d703e, 178fcbfb, 1bfc4729, 1c786137. The goal of this iteration is to solve all those tasks by implementing new training tasks and/or improving the solving algorithm.

I should avoid creating training tasks that are clones from the real ARC tasks, otherwise I cannot measure the generalization capability of the model. My goal should be to write training tasks that teach the core knowledge that is needed for ARC.

Development

New tasks to implement

Sort objects and do something to them based on the order. I can sort objects based on: area, x, y. I can move the objects, change their colors. This requires more control over the input images.
Learn to use the color of the object. Let's focus on monochrome objects by now. Based on the color of the object something is done (move, change color, crop)
Aggregate properties and use them to select, f.e. most/least popular color/area/shape...
Learn to draw using object center as a reference, points, lines (also vertical and horizontal), rectangles...
Create more tasks with apply_colormap
Learn to draw using color of the objects as a reference
More tasks about selecting an object that has some unique or extreme property

Training

Click to expand/collapse this section

export N_GPUS=2
export PARAMETERS=0.5B
export LEARNING_RATE=1e-4
export STEPS=2000; condor_submit train.condor command="
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu  \
/mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/finetuning.py \
--model_path /mnt/scratch/users/gbarbadillo/arc25/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-18-more-training-tasks/${N_GPUS}xA6000-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${LEARNING_RATE}lr \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--learning-rate ${LEARNING_RATE} \
--per-device-train-batch-size 2 \
--per-device-eval-batch-size 4 \
--batch-size 32 \
--max-seq-len 6144 \
--logging-steps 10 \
--eval-steps 50 \
--save-steps 200 \
--lora-r 32 \
--use-dora \
--use-rslora" -append request_gpus=${N_GPUS} -append request_cpus=8

export N_GPUS=2
export PARAMETERS=1.5B
export LEARNING_RATE=1e-4
export STEPS=16000; condor_submit train.condor command="
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu  \
/mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/finetuning.py \
--model_path /mnt/scratch/users/gbarbadillo/arc25/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-18-more-training-tasks/${N_GPUS}xA6000-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${LEARNING_RATE}lr \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--learning-rate ${LEARNING_RATE} \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 2 \
--batch-size 32 \
--max-seq-len 6144 \
--logging-steps 10 \
--eval-steps 50 \
--save-steps 200 \
--lora-r 32 \
--use-dora \
--use-rslora" -append request_gpus=${N_GPUS} -append request_cpus=8

export N_GPUS=2
export PARAMETERS=3B
export LEARNING_RATE=1e-4
export STEPS=16000; condor_submit train.condor command="
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu  \
/mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/finetuning.py \
--model_path /mnt/scratch/users/gbarbadillo/arc25/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-18-more-training-tasks/${N_GPUS}xA6000-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${LEARNING_RATE}lr \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--learning-rate ${LEARNING_RATE} \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 2 \
--batch-size 32 \
--max-seq-len 6144 \
--logging-steps 10 \
--eval-steps 50 \
--save-steps 200 \
--lora-r 32 \
--use-dora \
--use-liger-kernel \
--use-rslora" -append request_gpus=${N_GPUS} -append request_cpus=8

export N_GPUS=2
export PARAMETERS=7B
export LEARNING_RATE=1e-4
export STEPS=16000; condor_submit train.condor command="
accelerate launch --num_processes ${N_GPUS} --num_machines 1 --mixed_precision bf16 --multi_gpu  \
/mnt/scratch/users/gbarbadillo/arc25/arc25/scripts/finetuning.py \
--model_path /mnt/scratch/users/gbarbadillo/arc25/models/Qwen2.5-Coder-${PARAMETERS}-Instruct/ \
--output-dir /mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-18-more-training-tasks/${N_GPUS}xA6000-Qwen2.5-Coder-${PARAMETERS}-${STEPS}steps-${LEARNING_RATE}lr \
--device-map None \
--max-steps ${STEPS} \
--n-gpus ${N_GPUS} \
--learning-rate ${LEARNING_RATE} \
--per-device-train-batch-size 1 \
--per-device-eval-batch-size 2 \
--batch-size 32 \
--max-seq-len 5120 \
--logging-steps 10 \
--eval-steps 50 \
--save-steps 200 \
--lora-r 32 \
--use-dora \
--use-liger-kernel \
--use-rslora" -append request_gpus=${N_GPUS} -append request_cpus=8

rsync -P -r calculon01:/mnt/scratch/users/gbarbadillo/arc25/trainings/2025-06-18-more-training-tasks /mnt/data/MEGA/TEMP --exclude wandb/* --exclude *.pt

Results

Influence of training steps and diversity of predictions

When making predictions with a model trained for 8k steps I was surprised to see that only produced 1 unique prediction (making a ton of repeated predictions)

task \ steps (k)	1	2	4	8	16
1bfc4729	2	16	10	1	14
0ca9ddb6	69	31	20	17	11
178fcbfb	5	10	5	8	6
0d3d703e	121	120	115	14	88

The table below shows the number of unique and valid predictions for the Qwen2.5-Coder-0.5 model. The total number of predictions was 136. The relation is unclear and inconsistent between tasks.

Thus so far does not seem that training for longer reduces the model predictions diversity.

Analysis of trying to solve the tasks

task \ model	0.5B@1k steps	1.5B@1k steps	1.5B@16k steps	3B@1k steps	7B@1k steps
08ed6ac7				does not understand that the task is about changing colors, sorting the objects by area	does not understand that the task is about changing colors, sorting the objects by area
0b148d64				the most succesfull approach is downscaling instead of selecting and cropping	OOM
0ca9ddb6	draws 3 points, tries to use area for color. Tried an attempt to use the color as input	draws 2 points, tries to use area for color	draws 2 points, tries to use area for color	draws 4 points, but doesn't understand that color depends on the object color, tries to use the area	draws 3 points, then starts to draw lines
0d3d703e	does not understand that is about colormaps	does not understand that is about colormaps	Solved at epoch 6	Solved at epoch 2	Solved at epoch 3
178fcbfb	draws vertical or horizontal lines, but not both	draws vertical and horizontal lines, but does not understand there is a condition	only vertical lines, very low diversity	draws vertical and horizontal lines, but does not understand there is a condition	draws vertical and horizontal lines, but does not understand there is a condition
1bfc4729	only horizontal lines	only horizontal lines	does not understand the task, draws horizontal lines on the points and the rest is garbage	low diversity in predictions, does not improve over horizontal lines	many different predictions, but not in the correct direction
1c786137		chooses the object using height instead of area, maybe another property is needed. Probably color should be used		does not understand the task	OOM

Thoughts

I have the feeling that bigger models do better
I have solved the first real ARC task, although it was very simple it required adaptation with HER
But the lack of generalization is worrying, maybe the training data generation strategy is not the best
Lack of creativity, only does what it has learned to do during training
HER works, but needs a model with diverse predictions and good intuition

If the model is in the right direction, I believe it's very likely that HER will help to achieve the correct solution. However so far the model is lacking that ability to understand the tasks and use the appropriate DSL primitives to solve the problem.

Another problem is the low diversity in the proposed solutions. For some tasks-model combinations it is as low as proposing the same solution over and over. Reinforcement learning requires exploration to solve a problem, and in many cases the solution space is not being explored correctly.

Deep learning works when the training set densely covers the space. That is not the case for the current training tasks. It was the case for the toy drawing problem, because the space was small. However when the DSL grows that becomes more and more difficult.

Conclusion

On this iteration I have prepared new sample tasks to learn how to use the DSL. Despite of doing this job only one real ARC task was solved (and it was simply applying a colormap).

I have to rethink the approach, because the current implementation does not correctly explore the solution space. Only explores a small fraction of the solution space and repeats the same errors over and over.

Next steps

Better sampling strategy. Could play with temperature, top_k and top_p to create more diverse samples. https://huggingface.co/docs/transformers/v4.52.3/en/main_classes/text_generation#transformers.GenerationConfig.temperature
Better training objective. label_smoothing_factor might be used to preserve entropy. https://huggingface.co/docs/transformers/v4.52.3/en/main_classes/trainer#transformers.TrainingArguments.label_smoothing_factor
Validation might be solved ARC tasks. That way I could better measure the effect of the training tasks.
Reread transduction and induction paper, and code.
What if I give hints of how to solve the problem? Is the model capable on that case?

TODO

Write new training tasks to solve the current knowledge gaps of the model
I need a way to do evaluation at scale, using multiple GPUs, and saving all the generated tasks when searching for a solution.
If possible I should use Kaggle compute for evaluation. It is almost free and is a good way to store and visualize results.
Compositionality, can the model solve the task that selects the biggest object, crop and trim? That would be a good example of compositionality because those functions were not used together in the dataset
Sequential solving. Try also solving the tasks in multiple steps, not just once. It could help with compositionality.