Iteration 37. Optimize code generation
10-10-2024
Goal
We have verified that we can can solve ARC tasks by generating python code. Let's try to understand the dynamics of this new training and optimize the hyperparameters.
Motivation
Solving tasks with code works, but can I optimize and improve the accuracy of the model?
Development
Results
How does the training steps affect the accuracy?
We don't see a clear relation between the training steps and model accuracy for this version of omni-arc.
Is it helpful to learn to do other tasks?
The mean pass_n of the experiments that only learn one task is 3.1%, while the experiments that learn multiple tasks is 4.1%. So clearly in this experiment learning multiple tasks is beneficial.
Is it helpful to use a temperature different than 0?
There is great uncertainty in the results, so the best way to study the tendency is to compute the mean value for all the experiments.
The improvement is not huge, but we get better results on average when using a temperature of 0.7
Is there any difference between prompts?
prompt_version | experiment 1 pass_n | experiment 2 pass_n |
---|---|---|
0 | 0.035 | 0.0475 |
1 | 0.0425 | 0.0425 |
2 | 0.0325 | 0.0425 |
There isn't a clear winner.
What if I train on omni-arc just to create the output grid?
experiment | pass_n | vote_2 | vote_1 |
---|---|---|---|
code-from-examples | 5.00% | 5.00% | 5.00% |
output-from-examples | 9.25% | 7.50% | 6.13% |
In the best case we solve 5% of the evaluation tasks with the code approach. If we train on the same
data but we predict the grids directly we can get almost double pass_n
, but vote_1
is much closer.
So maybe current performance is good for the amount and quality of data we have.
Conclusion
We have not been able to improve over the previous iteration: we only solve 5% of the evaluation tasks.
However we have gained the following learnings:
- It is beneficial to train to learn multiple tasks, not just
code-from-examples
- Using a temperature of 0.7 is beneficial
- There isn't a clear relation between training steps and model accuracy
- The choose of prompts does not seem to be very relevant
Next steps
- Try with bigger models. If test-time fine-tuning is not necessary we might benefit from using bigger or coding models. F.e.
- https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct
- https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct
- Improve the omni-arc dataset:
- Add more tasks to increase coverage
- Add more training inputs to have more variability (can I reuse re-arc for this?)
- Add task variations
- Add task to learn to use the primitives
- Does test-time fine-tuning help to generate better code?
TODO
- How does the training steps affect the accuracy? -> Run trainings with different training lenght, just using code data
- What is the best prompt? Is there any difference?
- Is it helpful to learn to do other tasks?
- What if I train on omniarc just on the default task?