Skip to content

Iteration 8. Fine-tuning

13/06/2024

Goal

Can I improve the LB score by fine-tuning DeepSeekMath on MATH problems?

Motivation

The best team is getting a score of 27 on leaderboard, while my luckiest submission only gets 22. I have done already many experiments with DeepSeekMath-RL model, so maybe their advantage is that they have fine-tuned the model to high school problems.

Development

Direct Preference Optimization

My idea is to use DPO to teach the model to better solve the math problems. I have evaluated the MATH dataset many times and thus for each problem I have a lot of good and bad answers. I can use that data to fine-tune the model.

Then I will evaluate on the test set and I should get a huge improvement since I will be training and evaluating on the same set. But that will validate that the training has worked.

The real test will be the leaderboard. If I see improvements in the leaderboard then the next step would be to gather new data to evaluate and later fine-tune the model.

How to train with DPO?

I have to create a dataset with the fields: prompt, chosen and rejected. The prompt does not need to be in the answers. Source code

Dataset for DPO

  • 509 MATH test level 5 problems
  • 10661 pairs of good and bad responses
  • Max prompt length: 296
  • Max length: 937

Results

Validation results

I haven't been able to improve the validation accuracy consistently despite training and evaluating on the same dataset. I have seen improvements but too small considering I was training on that dataset.

experiment runtime(min) Accuracy maj Accuracy pass
baseline 134 46% 58%
01_first_steps 172 48% 54%
02_shuffle_train_set 163 53% 59%
03_4_epochs 190 47% 50%
04_4_epochs_constant_lr 183 52% 56%
06_v1_dataset 189 55% 59%
07_v2_dataset 182 49% 53%
  • On v1 version of the dataset I moved the python code block start to the prompt
  • On v2 version I used the same number of train pairs for each problem and increased the dataset size to 25k pairs.

To verify that the model was learning I added a silly comment in the python code but only was generated on inference on 18% of the responses, despite being so easy to learn.

Conclusion

I haven't been able to successfully fine-tune DeepSeekMath model. I haven't made a submission since I did not get good results on validation.

Next steps

TODO

  • Notebook to create train dataset, that will make the training notebook shorter.
  • How to train the model using DPO and LoRA?
  • How to make inference with VLLM and LoRA?

Last update: 2024-06-17
Back to top