Skip to content

Iteration 40. Try coding models

12-10-2024

Goal

Can we improve the results of coding generation by using models that are specialized in code?

Motivation

I have the intuition that using a model that has been specialized in coding can give better results than a simple instruct model.

Development

Qwen already has coding models, so it would be very convenient if I can use those models:

  • https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct
  • https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct

Analyze tokenizers

Both coder models use the same tokenizer as the Qwen non-coder models, so I don't have to use a different grid encoder. I can run the exact same training with these models. The only difference might be the VRAM requirements.

VRAM requirements

When training on A6000 gpus that have 48GB of VRAM I can use a max_seq_len of 6144 with the 1.5B model. If I train the 7B model on 2GPUs the max_seq_len has to be 4096 or I get OOM errors.

I have checked the trainings with Llama-3.1-8B and I also used a max_seq_len of 4096.

If I could have access to a GPU with 80GB of VRAM I could increase the training context length.

Omni-arc state

All these experiments were done with omni-arc at commit 589b6695d56ad2dbd7f37c78e9923cdec646ef54. There were around 150 implemented training tasks.

Results

Training speed

A training for 2k steps with batch size 32 takes:

  • 16000s (4.4h) for Qwen2.5-0.5B-Instruct
  • 21000s (5.8h) for Qwen2.5-Coder-1.5B-Instruct
  • 60000s (16.6h) for Qwen2.5-Coder-7B-Instruct

The 7B mdoel is much slower, but the 1.5B has acceptable speed. Maybe the GPU is not being used at its maximum capacity when training the 0.5B model.

Inference speed

  • 300 seconds for Qwen2.5-0.5B-Instruct
  • 504 seconds for Qwen2.5-Coder-1.5B-Instruct
  • 1311.5445 seconds for Qwen2.5-Coder-7B-Instruct

Conclusion

Next steps

TODO

  • Can we beat the baseline 5% accuracy of Qwen2.5-0.5B-Instruct with coder models?
  • Inference and train speed comparison of the models
  • What if I decrease the size or LoRA rank?
  • How well do these models scale with the number of predictions?