Iteration 10. Fine-tune on high quality data

08-04-2024

Goal

Can I improve my leaderboard score using the recently create high quality data?

Motivation

I believe I have to forget about mean-prompts and bad T5 embeddings and simply focus on building the best model possible for prompt recovery.

It's time to see if the newly created dataset gives better results than previous fine-tunings.

Development

Results

Train just on completions

On a first step I have done a first experiment using previous data were the model was trained just on completions. This means that the model will only be trained on the recovery prompts, not on generating the original and rewritten text.

ground truth	base model	v1	v2	v3, just on completions
gemma prompt	0.6742	0.7095	0.7062	0.716
gpt4 recovered prompt	0.7005	0.8108	0.8126	0.8245
LB	-	0.61	0.61	0.61

We can see an improvement in validation, but do not see improvements on leaderboard.

The training dynamics were changed and the model overfitted earlier. This is very likely due to the training output being shorter. On previous fine-tunings the best validation epoch was around 15, while on this training it was around 6.

Thus we see an improvement on validation score and also the trainings are faster because the best epoch is achieved faster.

First trainings on new data

version	loss	output type	data	LB score
1	Full	CoT	1/2 mooney_test_with_gpt4	0.61
2	Full	CoT	mooney_test_with_gpt4	0.61
3	Completions	CoT	mooney_test_with_gpt4	0.61
4	Completions	Prompt	high_quality_dataset_v1	0.59
5	Completions	Prompt	high_quality_dataset_v1, mooney_test_with_gpt4	0.62
6	Completions	Prompt	high_quality_dataset_v1, mooney_test_with_gpt4, gemma_suppl_rewrite_curated_with_gpt4	0.61

v5 model reaches the best LB score so far for any individual model (without any mean prompt combination)
This probes that chain of thought (CoT) prompts were not necessary. It seemed like a good idea, but in hindsight we were just saying the same in the thoughts and in the prompt.
It might be possible that we need more that if we train on completions and just prompts, because the number of tokens that are used for training is much smaller.

Submission with multiprompt

Since the model is just predicting a short prompt the submission is much faster. It runs in less than 3 hours. Thus it is possible to make more than one prediction for each sample and concatenate all together.

model version	single prompt LB score	multiprompt LB score
4	0.59	0.61
5	0.62	0.64
6	0.61	0.62

All the experiments improve when making multiple predictions. 0.64 is the best result so far with a single model. This could be a way, but I'm far from 0.70.

Do not quantize Mixtral gates

I have read that quantizing Mixtral gates could be problematic. Thus I have created a notebook to see if I can avoid that quantization.

If we don't quantize the gates and the lm_head of Mixtral the memory usage is almost the same since they are small layers.

Just add llm_int8_skip_modules=['gate', 'lm_head'], to the configuration.

References:

To see the effect of this change I have launched two experiments:

Fine-tune a model without quantizing those layers
Resubmit the fork from the forum where I replaced Mistral by Mixtral.

The results on leaderboard do not change. The fine-tuned model gets 0.61 and the fork gets 0.62, exactly as before.

What if I fine-tune Mistral?

Validation loss during training is slightly higher: 0.7679 vs 0.7456

https://www.kaggle.com/datasets/ahmadsaladin/mistral-7b-it-v02

However in LB score I get exactly the same score as Mixtral: 0.61

This is the second evidence against the use of Mixtral, the first one was the forum notebook where I simply replaced Mistral by Mixtral and got the same score.

Reducing LoRA size on Mistral

lora_r	best_val_loss	model_slug	LB
16	0.6143	mistral_v2	0.62
8	0.6234
4	0.6128
2	0.6128
1	0.5916	mistral_v3	0.62

The size of the weights decreases from 54.6 to 3.4 MB
Maybe the task is easy, so we are just learning some kind of "good prompt"
Maybe with bigger datasets the r becomes relevant

Combination of transformers

I tried making a submission with different models, but it scored worse than using the same model multiple times.

Conclusions

By fine-tuning on my own high quality data I was able to reach a LB score of 0.62, better than the previous 0.61.
Moreover by making multiple inferences with the same model I was able to improve that score to 0.64.
Training with lora r=1 gave same or better results than r=16, suggesting that this task does not need a big model change.
There is no evidence that Mixtral gives better results than Mistral.
If I concatenate the model predictions with the mean prompt that scores 0.63 I'm able to reach a LB score of 0.66

Next steps

Try using LLama 2 13B, https://www.kaggle.com/models/metaresearch/llama-2/pyTorch/13b-chat-hf
Why Mixtral is not getting better results than Mistral?

TODO

What if MoE does not deal correctly with quantization and should I leave some layers as they are?
What if I make multiple predictions with the same model and concatenate them?
Or with different adapters?
Upload new models to: https://www.kaggle.com/models/ironbar/mixtral-prompt-recovery
Create new data with Newtonbaba? The prompts didn't look right
Evaluate new dataset
New data with multiple prompt instructions.
What if I fine-tune Mistral instead of Mixtral?

Last update: 2024-04-15