Iteration 20. Bigger models
04-09-2024
Goal
Study the tendency of using bigger models.
Motivation
It seems that using bigger models gives better results, I want to dig deeper into that trend.
Development
I can fine-tune Qwen2-7B
using 2xA6000 GPUs (80GB of VRAM). I have tried fine-tuning Qwen2-72B
on 8xA6000 but it
gave OOM error. I only could fine-tune when using int4 quantization, but it was terribly slow, 700s per batch.
Results
Train on ARC tasks
model | train steps | pass_2 |
---|---|---|
Qwen2-0.5B | 6000 | 7.10% |
Qwen2-1.5B | 6000 | 12.20% |
Qwen2-7B | 6000 | 20.40% |
Clearly the bigger models generalize better for the same number of training steps.
It seems that the accuracy improves linearly with the log of the parameters. If the trend continues we will get a pass_2 accuracy of 50% if fine-tuning GPT4.
parameters | pass_2 estimation |
---|---|
0.5 | 7.10% |
1.5 | 12.58% |
7 | 20.27% |
72 | 31.91% |
2000 | 48.50% |
Click to see more detailed results
| model | train steps | train loss | val loss | accuracy | correct_pixels | correct_size | pass_64 | unanswered | pass_2 | |------------|-------------|------------|----------|----------|----------------|--------------|---------|------------|--------| | Qwen2-0.5B | 6000 | 0.053 | 0.16 | 2.80% | 66.30% | 84.20% | 18.50% | 2.80% | 7.10% | | Qwen2-1.5B | 6000 | 0.0302 | 0.154 | 5.30% | 69.60% | 87.50% | 26.00% | 2.80% | 12.20% | | Qwen2-7B | 6000 | 0.0135 | 0.129 | 7.30% | 71.30% | 87.70% | 27.00% | 3.20% | 20.40% | | model | train steps | train loss | val loss | accuracy | correct_pixels | correct_size | pass_64 | unanswered | pass_2 | |------------|-------------|------------|----------|----------|----------------|--------------|---------|------------|--------| | Qwen2-0.5B | 3000 | 0.075 | 0.157 | 1.60% | 66.30% | 85.10% | 11.50% | 3.40% | 7.10% | | Qwen2-1.5B | 3000 | 0.0505 | 0.138 | 3.70% | 68.70% | 86.50% | 19.50% | 3.10% | 11.20% | | Qwen2-7B | 3000 | 0.026 | 0.114 | 6.70% | 70.60% | 87.00% | 32.00% | 3.50% | 17.30% |Test-time fine-tuning
Already done a comparison on grid representation iteration
Conclusion
Next steps
TODO
- Can I make a submission with Qwen2-7B? Could I make test-time fine-tuning using quantization?
- Would Qwen2-7B improve if training for 12k or 24k steps?
- Are there competitive models in the 14B-30B ballpark to try?