Iteration 37. Optimize code generation

10-10-2024

Goal

We have verified that we can can solve ARC tasks by generating python code. Let's try to understand the dynamics of this new training and optimize the hyperparameters.

Motivation

Solving tasks with code works, but can I optimize and improve the accuracy of the model?

Development

Results

How does the training steps affect the accuracy?

training steps

We don't see a clear relation between the training steps and model accuracy for this version of omni-arc.

Is it helpful to learn to do other tasks?

The mean pass_n of the experiments that only learn one task is 3.1%, while the experiments that learn multiple tasks is 4.1%. So clearly in this experiment learning multiple tasks is beneficial.

Is it helpful to use a temperature different than 0?

There is great uncertainty in the results, so the best way to study the tendency is to compute the mean value for all the experiments.

temperature effect

The improvement is not huge, but we get better results on average when using a temperature of 0.7

Is there any difference between prompts?

prompt_version	experiment 1 pass_n	experiment 2 pass_n
0	0.035	0.0475
1	0.0425	0.0425
2	0.0325	0.0425

There isn't a clear winner.

What if I train on omni-arc just to create the output grid?

experiment	pass_n	vote_2	vote_1
code-from-examples	5.00%	5.00%	5.00%
output-from-examples	9.25%	7.50%	6.13%

In the best case we solve 5% of the evaluation tasks with the code approach. If we train on the same data but we predict the grids directly we can get almost double pass_n, but vote_1 is much closer.

So maybe current performance is good for the amount and quality of data we have.

Conclusion

We have not been able to improve over the previous iteration: we only solve 5% of the evaluation tasks.

However we have gained the following learnings:

It is beneficial to train to learn multiple tasks, not just code-from-examples
Using a temperature of 0.7 is beneficial
There isn't a clear relation between training steps and model accuracy
The choose of prompts does not seem to be very relevant

Next steps

Try with bigger models. If test-time fine-tuning is not necessary we might benefit from using bigger or coding models. F.e.
- https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct
- https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct
Improve the omni-arc dataset:
- Add more tasks to increase coverage
- Add more training inputs to have more variability (can I reuse re-arc for this?)
- Add task variations
- Add task to learn to use the primitives
Does test-time fine-tuning help to generate better code?

TODO

How does the training steps affect the accuracy? -> Run trainings with different training lenght, just using code data
What is the best prompt? Is there any difference?
Is it helpful to learn to do other tasks?
What if I train on omniarc just on the default task?