Skip to content

Iteration 28. Optimal train duration and capacity

22-09-2024

Goal

What is the optimal train duration and capacity to improve the validation score? And for the leaderboard score?

Motivation

Previous experiments suggest that training for longer might improve generalization. I want to validate that on the evaluation set.

Results

Optimal train duration for full fine-tune

metrics vs number of steps

All metrics improve when training for longer, we don't see signs of plateau on this experiment. Continual training might be a good option for this challenge.

Does LoRA achieve better generalization?

These experiments are for 40k steps, batch size 16.

experiment accuracy pass_32 vote_2 vote_1
full fine-tuning 12.25% 31.13% 22.85% 18.69%
LoRA 32 11.10% 30.25% 22.85% 19.07%
LoRA 128 12.73% 32.25% 22.47% 19.19%

There is no evidence that full fine-tuning generalizes better than LoRA. Moreover considering that so far using a pretrained LoRA for TTFT gives better results it is likely that training a LoRA will give better results

Conclusion

We have verified that training for longer improves generalization. No sign of plateau was observed after training for 80k steps. Thus it's likely that when training to do two tasks we could train further than 160k steps and still see improvements.

So far there is no evidence that full fine-tuning is better than LoRA. I need to validate this on the leaderboard.

Next steps

  • Train new models for submission: full fine-tuning and also LoRA and compare the results on the LB

TODO

  • Determine the optimal training steps for a full model fine-tune
  • Do the same for different LoRAs
  • Does the optimal training steps double when also learning to predict input from inputs?
  • Do the results hold for the leaderboard? How to account for the extra training data?