Iteration 29. Qwen 2.5
25-09-2024
Goal
A new release of Qwen was announced yesterday: Qwen 2.5, does it improve the accuracy on ARC?
Motivation
Simply swapping the model might bring improvements for free!
Development
- https://qwenlm.github.io/blog/qwen2.5-llm/
- https://qwenlm.github.io/blog/qwen2.5-llm/#qwen25-05b15b3b-performance
- https://qwenlm.github.io/blog/qwen2.5-llm/#qwen25-05b15b-instruct-performance
The size of the pre-training dataset is expanded from 7 trillion tokens to a maximum of 18 trillion tokens.
Whe looking at benchmarks we see a noticeable improvement between Qwen 2 and 2.5.
Results
Validation results when training different models for 10k steps
model | accuracy | pass_32 | vote_2 |
---|---|---|---|
Qwen2-0.5B | 8.24% | 26.50% | 15.91% |
Qwen2-0.5B-Instruct | 8.25% | 26.75% | 15.91% |
Qwen2.5-0.5B | 9.37% | 26.75% | 18.31% |
Qwen2.5-0.5B-Instruct | 8.98% | 26.00% | 17.93% |
Both versions of Qwen2.5
achieve better results than Qwen2-0.5B-Instruct
.
Why models that aren't instruct are slower?
Inference with Qwen2.5B took 4920.6132 seconds, compared to the typical 1809.6193. Why?
Inspecting the responses I have found that the non-instruct versions do repeat the prediction multiple times. E.g.
<|im_start|>assistant
### Output
```grid shape: 6x6
1 595959
2 181818
3 959595
4 818181
5 595959
6 181818
```
Assistant
### Output
```grid shape: 6x6
1 595959
2 181818
3 959595
4 818181
5 595959
6 181818
```
Assistant
...
Conclusion
We have observed improvements when replacing Qwen2 by Qwen2.5. The most promising model is the non-instruct version but there is a problem at inference: it does not stop predicting. Until that problem is not solved I will use Qwen2.5
Next steps
TODO
- Do the same experiment just changing the base model and compare the validation results
- Could I use Qwen-2.5-0.5B and avoid having repetitions in the prediction?
- Check the training data.
- Check the tokenizer
- Local experiments
- Maybe adding some new stopword to VLLM could be a quick fix