Skip to content

Iteration 13. Data centric approach around Mistral 7B

11-04-2024

Goal

What is the highest LB score I can get with Mistral 7B?

Motivation

Training with Mistral is much faster than training with Mixtral. And the leaderboard results are the same.

On this iteration I'm going to change the training data and try to get the best possible LB score.

This is a continuation of iterations 8 and 10, but focusing on Mistral and the data.

Development

Explore public datasets

dataset n n_prompts ratio median_tokens
dipamc77_prompts_0_500_wiki_first_para_3000_curated 2872 494 0.17 215
gemma_suppl_rewrite_curated 298 189 0.63 171
nbroad-v1_curated 2162 109 0.05 1120
nbroad-v2_curated 2400 2400 1.00 1135
winddude_70k_gemma_template_built_curated 69487 61947 0.89 666
aishaalmahmoud/llm_dataset_1_curated 9842 839 0.09 197
aishaalmahmoud/llm_dataset_20k_curated 16620 860 0.05 198
alexxxsem/data_gemma_0_1000_curated 994 41 0.04 174
galileo/gemma_v1_7b-it_curated 26160 1469 0.06 427
newtonbaba/gemini_data_set_prompt_recover_3_curated 1802 1778 0.99 198
newtonbaba/gemma_data_set_prompt_recover_1_curated 994 766 0.77 418
newtonbaba/gemma_data_set_prompt_recover_2_curated 1530 1530 1.00 604
  • Except for nbroad the rest of the datasets have a reasonable number of tokens
  • I'm going to train in all the datasets and see what is the leaderboard and validation score. If I group the datasets by creator that would be 8 experiments.

Wandb

With the code below it is possible to log different runs in the same notebook to Weights and Bias.

import wandb
w = wandb.init(reinit=True, project='datacentric_mistral', name=experiment_name)
...
w.finish()

Results

Train on public datasets

dataset rows promtps val loss LB score
newtonbaba 2796 2544 2.32 0.55
nbroad 4562 2509 2.4 0.53
galileo 26160 1469 3 0.52
alexxxsem 994 41 4.74 0.52
aishallmahmoud 26000 1700 2.57 0.52
winddude 69487 61947 2.87 0.53
dipacmc77 2872 494 3.22 0.53
gemma_suppl_rewrite 298 189 3.33 0.60
mooney_test_with_gpt4 359 - - 0.61
high_quality_dataset_v1 280 - - 0.59
high_quality_datases +
mooney_test_with_gpt4 +
gemma_suppl_rewrite_curated_with_gpt4
1200 - - 0.62
  • Size of the dataset seems to be irrelevant
  • All the public datasets except gemma_suppl_rewrite seem to be useless.
  • Training on my data probed to be the best solution so far
  • The training data is very important, so maybe be creating better training data I can improve my LB score

Validation loss

validation loss evolution

  • The validation loss diverged after step 50, that is 1600 train samples (batch size 32)
  • I might have to add a custom metric
  • There is no relation between validation loss and LB score, this is bad

validation loss vs lb score

Conclusion

All the public datasets except gemma_suppl_rewrite seem to be useless. I was able to get a LB score of 0.60 using that dataset. This is a worse score to 0.62 when using my own data, so I should double down on creating more high quality data.

Next steps

  • Create more high quality data and fine-tune Mistral.

TODO

  • Is there any useful public dataset that I can use directly? Measure text length and prompt diversity.
  • Collect useful prompts from other datasets
  • Generate samples with multi-instruction prompts (similar to leaked data)
  • New notebook for sequential training
  • Gain more control over wandb for easier inspection
  • Increase batch size and maybe context length
  • Make wandb work with sequential runs on the same notebook

Last update: 2024-04-12
Back to top