Skip to content

Iteration 3. Ideal test-time training setup

07-04-2025

Goal

Update the architects' code to be able to make training and inference for each task.

Motivation

ARC tasks are independent, thus when doing test-time training is better to focus on each task instead of training on all the tasks at the same time. Knowledge transfer between the different tasks should be very small, so fine-tuning a custom model for each task should be the best strategy.

I don't believe ARC can be solved using last year ARC24 solution, but being able to do test-time training for each task efficiently is very likely a part of this year solution.

Development

How the ideal solution looks like

  1. We run 4 or 8 training processes in parallel, with batch size 1. Each training process would pick one remaining task, reset the PEFT model, train and save to disk.
  2. We run 8 inference processes. Each inference process would pick one remaining task, load the PEFT, make inference and save results to disk. Ideally each process would do as many predictions as possible during the available time.

The unknown is how to load and unload the PEFT model efficiently. Every delay associated with changing the PEFT will be multiplied by 15 or 30 (depending if I use 8 or 4 processes.) So loading the model from disk, compiling... I need to find a way to do it really fast or even better don't have to do it.

Time per task

We have 12 hours to solve 240 tasks (submission evaluates both test sets). If we parallelize the system with 4 runs, that means we have 12 minutes per task. So if doing inference for each task introduces an overhead of 1 minute per task, that still leaves 11 minutes per task. So even a non efficient solution that wastes 1 minute to load and compile the model per task will have most of the time for compute.

Implementation

In this notebook I have prepared an implementation that uses locks to select the GPU and the task.

Loading the model for training could take around 20s, for inference it is around 14s. So in total we could see a delay of around 30s per task, so around 30 minutes in total for a submission time of 12h hours, we can afford that.

Base model in dev/shm

I have tried copying the model to dev/shm but did not observed any speedup. Probably when I read the model for the first time it is cached. The model is slightly less than 4GB. Notebook with experiments

Batch size and training speed

experiment batch_size=1 batch_size=2 batch_size=4
2 shortest tasks, 4 epochs 72s 63s 58s
2 longest tasks, 1 epoch 125s 122s 136s

Clearly it pays to use a batch size of 1 if the gradient has enough information, that will allow to update the model more times.

Comparison with my solution for ARC24 challenge

In my solution I could do 320 training steps for each task on ARC24 challenge. I was using a model of just 0.5B parameters versus the current 7B parameters. Now if I use 6 epochs that would be just 48 training steps, so training is 10 times shorter.

Increasing GPU usage

After looking at the plots of GPU usage I have noticed that I could increase the number of slots per GPU both on training and inference.

train GPU slots inf GPU slots mean GPU usage max VRAM training time (s) inference time (s)
1 2 89.5% 51.4% 7087 7360
2 2 93.9% 50.2% 6864 6748
2 3 95.0% 76.6% 6962 6902

The most reliable metric is mean GPU usage. Inference time we already know that it is not reliable and there is some variability on training times due to the random assignment of the tasks. Using 2 slots per GPU for training and 3 for inference should give a speedup of around 6%, which is 43 minutes for the 12 hour run. Not game changing but very welcome.

Link to full results

Results

Evaluation vs test set

On this notebook I have run the exact same setup that has scored 10.17 on the leaderboard and took 9 hours to run.

If I run the exact same configuration on the evaluation set it only takes 4 hours and scores 10.6 (I'm not sure what the architects prints mean because on them the score is 8.7).

The difference in speed is caused because when we are doing the submission the system is evaluated against both partitions of the test set, so that is 240 tasks instead of 120. So I don't have to worry about my system doing timeout on the private test set because the system has already done predictions for it.

Training epochs

alt text

It seems that a small number of training epochs (6) is bad, but also once we reach a certain number of epochs (8-10) increasing the training length is not beneficial. Maybe I have to lower the learning rate when using a bigger number of epochs?

Train Max Sequence Length

alt text

The tendency is not very clear, but the best results are obtained when using 8192 which is the maximum training sequence length available for the current model.

Submission time increases slightly.

Lora rank

alt text

It might seem that using a bigger lora rank is beneficial.

Uncertainty on the LB results

Let's submit the same configuration 5 times, just changing the random seed.

uncertainty

This was very surprising because I wasn't expecting this level of variability. We can see a variation up to 3.5 points within the same submission just by changing the random seed.

This probably invalidates all the previous conclusions, because the difference in scores between experiments was not that high to be conclusive.

Learning rate

I might get better results with a lower learning rate and longer training?

learning rate

Clearly the learning rate has great influence on the score, I believe I should do a deeper study on a following iteration.

Inference parameters

I have done a few experiments with the number of predictions (n) and the min_prob without conclusive results.

This experiment was done with an earlier notebook that used 8 folds for splitting the data, I changed the number of predictions from 8 to 16 with very small variation in score.

train epochs n min_prob lora_rank LB score
6 2 0.17 32 6.67
6 1 0.17 32 6.25

This other experiment was done with the single task fine-tuning setup. We modify the min_prob but the effect is unclear.

train epochs n min_prob lora_rank LB score
10 1 0.17 16 11.94
10 1 0.13 16 7.92
10 1 0.17 32 11.1
10 1 0.13 32 11.1

Finally I have also done a sweep over min_prob when evaluating the evaluation set.

min_prob eval score runtime (h)
None 11.4 5.5
0.35 10.1 3.33
0.25 9.3 3.66
0.17 10.1 4.1
0.10 10.7 5.5

Conclusion

On this iteration I have been able to improve the LB score to 11.94, and I was the first team to beat the 10% barrier. However I have noticed that LB scores have a variability of up to 3.5 between submissions of the same configuration. Thus I believe I should stop making submissions and only return when I have made progress on the evaluation set.

Next steps

TODO

  • How GPU usage looks when using batch size 1?
  • What if I copy the base model to dev/shm
  • Tune the submission hyperparameters. My intuition is that I should train as long as possible, and make just 8 predictions per task.
    • Lora rank
    • Number of training epochs (better change epochs than learning rate when possible)
    • Inference parameters (n and min_prob)
    • Learning rate
    • Uncertainty on the results (what if I change the random seed?)
    • Are the training samples correctly sorted? Maybe they are not optimal for single task training. The order is random.
  • Check the evaluation prints of the architects. They are different to normal scoring
  • Make more evaluations on the evaluation set and compare to test set. I want to see a correlation of runtime and score.
  • What if I use 2 GPU slots for training? Currently just 40% of GPU memory is used.