Iteration 7. Optimize TTT on the evaluation set
10-05-2025
Goal
Optimize the hyperparameters of TTT on the evaluation set, and then make submissions to the test set.
Motivation
On previous iterations I have seen that there is variability on LB scores. There could be changes up to 3.5 points between submissions of the same configuration. My current best LB score is probably luck.
Previously I had the problem that evaluations took around 5 hours, and Kaggle only allows 15 hours per week for the 4xL4 machine. So at max I could do 3 evaluations per week which is very limited.
By analyzing all the evaluation set evaluations I have found that only 22 tasks were solved once. Thus I could use just those tasks instead of the whole 120 tasks to speedup evaluation. Moreover I have linked Google Colab Plus to Kaggle and now I have 22 hours per week. This means that now I could do around 22 evaluations per week, and that opens the door to optimize the hyperparameters on the evaluation set.
Development
Results
Lora rank and learning rate
The best learning rate seems to be around 2e-4, there is no evidence that lora 32 is better.
Number of predictions
Let's study if increasing the number of predictions has a significative effect on the accuracy.
There is no evidence suggesting that using more than 8 predictions is beneficial. Using 2 or 4 predictions is clearly not enough. 8 seems to be the sweet spot.
min_prob
For the first time we are scoring above 12 on the evaluation set. Despite the randomness of the scores
we see a clear trend of improvement when using a lower value of min_prob. The drawback is that it requires more execution time.
max_seq_length
It seems that if it is too small is it hurtful, otherwise there isn't a relation between the parameters. Thus I believe I should use the maximum possible value to be able to solve tasks with big grids.
Repeatability
| configuration | mean eval score | n_experiments |
|---|---|---|
| epochs=10, n=1, lr=1e-4, r=16, min_prob=0.17 | 9.8 ± 0.7 | 13 |
| epochs=10, n=1, lr=2e-4, r=4, min_prob=0.17 | 10.4 ± 0.8 | 11 |
| epochs=10, n=1, lr=2e-4, r=4, min_prob=0.1 | 10.6 ± 0.7 | 11 |
Evidence is not very strong, but it seems that for the evaluation set we have found a better configuration than the previous one.
However I have made submissions with the new configuration, and on the test set I don't see a difference.
| configuration | mean test score | n_experiments |
|---|---|---|
| epochs=10, n=1, lr=1e-4, min_prob=0.17, r=32 | 10.5 ± 1.3 | 5 |
| epochs=10, n=1, lr=1e-4, min_prob=0.17, r=16 | 9.9 ± 1.7 | 5 |
| epochs=10, n=1, lr=2e-4, min_prob=0.17, r=4, | 10.3 ± 1.3 | 5 |
Conclusion
We have optimized the hyperparameters for the evaluation set, but the improvements didn't transfer to the test set.
Next steps
- It might be helpful to initialize the LoRA on the new ARC25 tasks. Last year I observed that it was better to start from pretrained LoRA than to use a fresh new one.




