Iteration 29. Qwen 2.5

25-09-2024

Goal

A new release of Qwen was announced yesterday: Qwen 2.5, does it improve the accuracy on ARC?

Motivation

Simply swapping the model might bring improvements for free!

Development

https://qwenlm.github.io/blog/qwen2.5-llm/
https://qwenlm.github.io/blog/qwen2.5-llm/#qwen25-05b15b3b-performance
https://qwenlm.github.io/blog/qwen2.5-llm/#qwen25-05b15b-instruct-performance

The size of the pre-training dataset is expanded from 7 trillion tokens to a maximum of 18 trillion tokens.

Whe looking at benchmarks we see a noticeable improvement between Qwen 2 and 2.5.

Results

Validation results when training different models for 10k steps

model	accuracy	pass_32	vote_2
Qwen2-0.5B	8.24%	26.50%	15.91%
Qwen2-0.5B-Instruct	8.25%	26.75%	15.91%
Qwen2.5-0.5B	9.37%	26.75%	18.31%
Qwen2.5-0.5B-Instruct	8.98%	26.00%	17.93%

Both versions of Qwen2.5 achieve better results than Qwen2-0.5B-Instruct.

Why models that aren't instruct are slower?

Inference with Qwen2.5B took 4920.6132 seconds, compared to the typical 1809.6193. Why?

Inspecting the responses I have found that the non-instruct versions do repeat the prediction multiple times. E.g.

<|im_start|>assistant
### Output

```grid shape: 6x6
1 595959
2 181818
3 959595
4 818181
5 595959
6 181818
```
Assistant
### Output

```grid shape: 6x6
1 595959
2 181818
3 959595
4 818181
5 595959
6 181818
```
Assistant
...

Conclusion

We have observed improvements when replacing Qwen2 by Qwen2.5. The most promising model is the non-instruct version but there is a problem at inference: it does not stop predicting. Until that problem is not solved I will use Qwen2.5

Next steps

TODO

Do the same experiment just changing the base model and compare the validation results
Could I use Qwen-2.5-0.5B and avoid having repetitions in the prediction?
- Check the training data.
- Check the tokenizer
- Local experiments
- Maybe adding some new stopword to VLLM could be a quick fix