Iteration 36. Solving evaluation tasks with code

09-10-2024

Goal

Can we solve the evaluation tasks by predicting code that implements the tasks?

Motivation

On Iteration 34 I trained models on omni-arc tasks. It was unclear if the approach of output-from-examples benefited from training the model to do multiple tasks.

However if I can predict python code that could be game-changer because I can verify the python code with the train samples.

Development

First models were trained with 100 training tasks, second model with close to 150. Coverage of the training dataset is important because it's likely correlated with coverage of the evaluation and test dataset.

First steps with inference

Click to see bash commands

# baseline
python inference.py \
--model_path  /mnt/hdd0/Kaggle/arc24/models/20241006_omniarc_validation/02_omni-arc-400-code-from-examples-Qwen2.5-0.5B-Instruct_lr5e-5_14000steps_2gpus_8192msl/checkpoint-14000 \
--prompt_version code-from-examples-v0 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 8 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/first_predictions/checkpoint-14000/inference_evaluation_x008.json \
--verbose

python inference.py \
--model_path  /mnt/hdd0/Kaggle/arc24/models/20241006_omniarc_validation/02_omni-arc-400-code-from-examples-Qwen2.5-0.5B-Instruct_lr5e-5_14000steps_2gpus_8192msl/checkpoint-14000 \
--prompt_version code-from-examples-v0 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 32 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/first_predictions/checkpoint-14000/inference_evaluation_x032.json

python merge_lora.py --base_model_path /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct --lora_path /mnt/hdd0/MEGA/projects/temp/20241006_omniarc_validation/05_omni-arc-400-code-from-examples-v1-Qwen2.5-0.5B-Instruct_lora128_lr1e-4_bs32_7000steps_2gpus_8192msl/checkpoint-7000 --output_path /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc

python inference.py \
--model_path  /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc \
--prompt_version code-from-examples-v1 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 8 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/second_model/checkpoint-7000/inference_evaluation_x008.json \
--verbose

python inference.py \
--model_path  /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc \
--prompt_version code-from-examples-v1 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 32 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/second_model/checkpoint-7000/inference_evaluation_x032.json

python inference.py \
--model_path  /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc \
--prompt_version code-from-examples-v1 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 32 \
--temperature 0.5 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/second_model/checkpoint-7000/inference_evaluation_x032_t5e-1.json

python inference.py \
--model_path  /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc \
--prompt_version code-from-examples-v1 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 32 \
--temperature 0.7 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/second_model/checkpoint-7000/inference_evaluation_x032_t7e-1.json

python inference.py \
--model_path  /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc \
--prompt_version code-from-examples-v1 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 32 \
--temperature 0.9 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/second_model/checkpoint-7000/inference_evaluation_x032_t9e-1.json

python inference.py \
--model_path  /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc \
--prompt_version code-from-examples-v1 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 32 \
--temperature 1 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/second_model/checkpoint-7000/inference_evaluation_x032_t1.json

python inference.py \
--model_path  /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct-omni-arc \
--prompt_version code-from-examples-v1 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 128 \
--temperature 0.7 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/second_model/checkpoint-7000/inference_evaluation_x128_t7e-1.json


python inference.py \
--model_path  /mnt/hdd0/MEGA/projects/temp/20241006_omniarc_validation/03_omni-arc-800-all-code-Qwen2.5-0.5B-Instruct_lr5e-5_26000steps_2gpus_8192msl/checkpoint-26000 \
--prompt_version code-from-examples-v0 \
--dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json \
--predictions_per_task 8 \
--temperature 0.7 \
--output_filepath /mnt/hdd0/Kaggle/arc24/debug/third_model/checkpoint-26000/inference_evaluation_x008_t7e-1.json

The model is generating valid python code, I have to improve the inference script to check that the code is correct and create the output. Add timeouts for safety.

Results

Preliminary results

I solve 4% of the tasks from the evaluation dataset. All predictions seem to be correct because they are validated against the train dataset. When using temperature 0 there does not seem to be any favorable scaling law.

Made up to 132 predictions, but the accuracy improves very slowly. The output from examples approach had a very different dynamic.

model	accuracy	pass_n	vote_1	unanswered
1	2.41%	3.75%	3.75%	97.59%
2	3.00%	4.00%	4.00%	96.94%
3	2.98%	4.75%	4.75%	97.02%

On the best case we are able to solve close to 5% of the evaluation tasks. The most relevant aspect is that all the predictions are correct.

Token distribution of omni-arc code

token distribution

We can see that the code is much smaller than predicting the whole grid which can have up to 1000 tokens, but the code is 200 tokens at maximum, 5 times smaller.

How does the method scale with the number of predictions?

bad scaling

This first models do not scale well with the number of predictions, the improvement is very slow.

As a reference we can compare it to a experiment from Iteration 30

80000 training steps

My hypothesis is that the dataset is small and the model has not learned correctly yet.

First submissions

I have made a first submission with model: 20241006_omniarc_validation/05_omni-arc-400-code-from-examples-v1-Qwen2.5-0.5B-Instruct_lora128_lr1e-4_bs32_7000steps_2gpus_8192msl/checkpoint-7000 and it solved 1 of the private test set tasks. It's a humble beginning, but if I can make it work this could be game-changing.

I have also tried using test-time fine-tuning but then it did not solve any of the tasks.

Conclusion

We have been able to solve new tasks by generating python code. The first evaluation solves close to 5% of the evaluation tasks. The most relevant thing is that all the predictions are either correct or empty. Thus this approach is very good for ensembling.

Next steps

If the code approach does not pay off, we could try to train a model to verify the solutions. Given two solutions to the problem select the correct one.

TODO

How to execute the code safely and with timeouts? Check AIMO competition. This should be added to the omni-arc repo, because all the dsl functions are there.
How does the method scale with compute? Validation should allow to scale well.
Verify that everything is working fine by looking at the predictions. -> It is working fine, but the model is pretty bad at guessing the transformation. Probably more variability at the inputs is needed.
What is the token distribution of the functions that implement the training tasks?
Fix problems with evaluation metrics
Fix problem with inference returning an error code