Iteration 7. Optimize TTT on the evaluation set

10-05-2025

Goal

Optimize the hyperparameters of TTT on the evaluation set, and then make submissions to the test set.

Motivation

On previous iterations I have seen that there is variability on LB scores. There could be changes up to 3.5 points between submissions of the same configuration. My current best LB score is probably luck.

Previously I had the problem that evaluations took around 5 hours, and Kaggle only allows 15 hours per week for the 4xL4 machine. So at max I could do 3 evaluations per week which is very limited.

By analyzing all the evaluation set evaluations I have found that only 22 tasks were solved once. Thus I could use just those tasks instead of the whole 120 tasks to speedup evaluation. Moreover I have linked Google Colab Plus to Kaggle and now I have 22 hours per week. This means that now I could do around 22 evaluations per week, and that opens the door to optimize the hyperparameters on the evaluation set.

Development

Results

Lora rank and learning rate

Google sheet

The best learning rate seems to be around 2e-4, there is no evidence that lora 32 is better.

Number of predictions

Let's study if increasing the number of predictions has a significative effect on the accuracy.

There is no evidence suggesting that using more than 8 predictions is beneficial. Using 2 or 4 predictions is clearly not enough. 8 seems to be the sweet spot.

min_prob

For the first time we are scoring above 12 on the evaluation set. Despite the randomness of the scores we see a clear trend of improvement when using a lower value of min_prob. The drawback is that it requires more execution time.

max_seq_length

It seems that if it is too small is it hurtful, otherwise there isn't a relation between the parameters. Thus I believe I should use the maximum possible value to be able to solve tasks with big grids.

Repeatability

configuration	mean eval score	n_experiments
epochs=10, n=1, lr=1e-4, r=16, min_prob=0.17	9.8 ± 0.7	13
epochs=10, n=1, lr=2e-4, r=4, min_prob=0.17	10.4 ± 0.8	11
epochs=10, n=1, lr=2e-4, r=4, min_prob=0.1	10.6 ± 0.7	11

Evidence is not very strong, but it seems that for the evaluation set we have found a better configuration than the previous one.

However I have made submissions with the new configuration, and on the test set I don't see a difference.

configuration	mean test score	n_experiments
epochs=10, n=1, lr=1e-4, min_prob=0.17, r=32	10.5 ± 1.3	5
epochs=10, n=1, lr=1e-4, min_prob=0.17, r=16	9.9 ± 1.7	5
epochs=10, n=1, lr=2e-4, min_prob=0.17, r=4,	10.3 ± 1.3	5

Link to Google Sheet

Conclusion

We have optimized the hyperparameters for the evaluation set, but the improvements didn't transfer to the test set.

Next steps

It might be helpful to initialize the LoRA on the new ARC25 tasks. Last year I observed that it was better to start from pretrained LoRA than to use a fresh new one.