Iteration 34. Developing Omni-ARC

29-09-2024

Goal

Implement a first version of Omni-ARC and validate if the approach is promising:

Can we improve the generalization of the current approach by learning to predict code to do the tasks? Similar to the improvement that we got when learning the inputs distribution.

Motivation

Development

The right level of abstraction

With the right level of abstraction writing code to solve the training tasks is very easy and fast. I have implemented almost 100 training tasks in less than 2 days. Just with very basic primitive functions like detect and draw objects is possible to solve a lot of tasks.

Repeated or very similar tasks

On the training set I have detected some tasks that are exactly the same and other tasks are just variations of the same task. Maybe the dataset is not as big as I thought.

First version of Omni-ARC

I have implemented nearly 100 training tasks (25% of the tasks). I believe this is enough to make a training and see the effect it has on the model.

Local experiments

Let's make a quick training runs to verify that I can train using omni-arc dataset.

Click to see bash commands

# baseline
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241005_omni-arc/01_baseline \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 1000 \
--logging_steps 10 \
--random_seed 7 \
--batch_size 5 \
--learning_rate 4e-5 \
--verbose

# use omni-arc with output-from-examples-v1
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241005_omni-arc/02_omni-arc_output-from-examples-v1 \
--train_datasets omni-arc-100 output-from-examples-v1 \
--val_dataset omni-arc-100 output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 1000 \
--logging_steps 10 \
--random_seed 7 \
--batch_size 5 \
--learning_rate 4e-5 \
--verbose

# code from examples
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241005_omni-arc/02_omni-arc_code-from-examples-v0 \
--train_datasets omni-arc-100 code-from-examples-v0 \
--val_dataset omni-arc-100 code-from-examples-v0 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 1000 \
--logging_steps 10 \
--random_seed 7 \
--batch_size 5 \
--learning_rate 4e-5 \
--verbose

# output from code
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241005_omni-arc/02_omni-arc_output-from-code-v0 \
--train_datasets omni-arc-100 output-from-code-v0 \
--val_dataset omni-arc-100 output-from-code-v0 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 1000 \
--logging_steps 10 \
--random_seed 7 \
--batch_size 5 \
--learning_rate 4e-5 \
--verbose

# all together
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241005_omni-arc/02_omni-arc_all \
--train_datasets omni-arc-100 output-from-code-v0 \
--train_datasets omni-arc-100 code-from-examples-v0 \
--train_datasets omni-arc-100 output-from-examples-v1 \
--train_datasets omni-arc-100 input-from-inputs-v0 \
--val_dataset omni-arc-100 output-from-code-v0 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 1000 \
--logging_steps 10 \
--random_seed 7 \
--batch_size 5 \
--learning_rate 4e-5 \
--verbose

# prompt refinement
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241005_omni-arc/03_omni-arc_code-from-examples-v1 \
--train_datasets omni-arc-100 code-from-examples-v1 \
--val_dataset omni-arc-100 code-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 10 \
--logging_steps 10 \
--random_seed 7 \
--batch_size 5 \
--learning_rate 4e-5 \
--verbose

Experiment design

We have two dimensions to test:

Which tasks are useful?
How we should weight omni-arc dataset against all the other datasets?

The training duration should be increased proportionally to the new tasks.

The baseline has 1200 tasks.

Problem with non-instruct models

I thought that I had solved the problem with never-ending predictions from non-instruct models by modifying the tokenizer at the beginning of the train. However that is not true and I have had to relaunch all the trainings using an instruct version of Qwen. I might revisit this on the future.

Had to update transformers to 4.45.1

I had to add scipy to the docker and that updated transformers library. When doing inference on my PC there was an error when loading the model so I had to update transformers to 4.45.1

It's possible that I will have to update the Kaggle environment as well.

Verified that Kaggle and Github dataset are identical

Notebook

Results

Local experiments training metrics

local experiments

I have only trained for 1k steps. It seems that the difficulty from easier to harder is:

code from examples
output from examples
output from code

Training speed was not affected by using omni-arc. In fact it was faster but this could be due to the tasks being smaller.

Validation results

validation results

It is unclear if training on new tasks like code-from-examples and output-from-code has a positive effect on the initial task of output-from-examples. We added an additional 100 training samples to the initial 1200 samples. The weight indicates how frequently we sample omni-arc versus the other datasets.

Conclusion

We have implemented a first version of omni-arc and trained multiple models with it. The effect on the initial task of output-from-examples is unclear.

Next steps

Solve the problem with never ending predictions from non-instruct models. I would like to use Qwen2.5 base model for the challenge.
Can we solve some of the evaluation tasks using generated code?

TODO

Implement first version of omni-arc
Add new prompt templates
Update fine-tuning script to support omni-arc dataset
Is training speed affected by using omni-arc? I believe generation is fast enough to be done real-time
Clone omni-arc repo in the cluster and add the path to the PYTHONPATH
Refine the prompts using ChatGPT
Experiment to see if learning 3 tasks is better than learning two tasks. The baseline learns output-from-examples and input-from-inputs, the new experiment also learns code-from-examples. 10k steps per task.