Iteration 12. Fine-tuning Phi-2 and Gemma

11-04-2024

Goal

Try fine-tuning and making predictions with Phi-2 and Gemma-7b-it.

Motivation

My experiments with Mistral, Mixtral and LLama 13b show tiny differences between the models. Maybe I can get the same results using Phi-2 and Gemma-7b-it. If that is the case it is likely that an ensemble using all the models would score better than making multiple submissions with the same model.

Also a new Mistral-22B model is out that it is worth trying.

Development

The idea is to use the models in transformer format so I can reuse the code for fine-tuning previous models. I would have to look at the different prompt format of the new models.

Prompt format

Phi-2

https://www.kaggle.com/models/Microsoft/phi/Transformers/2

Instruct: Write a detailed analogy between mathematics and a lighthouse.
Output: Mathematics is like a lighthouse. Just as a lighthouse guides ships safely to shore, mathematics provides a guiding light in the world of numbers and logic. It helps us navigate through complex problems and find solutions. Just as a lighthouse emits a steady beam of light, mathematics provides a consistent framework for reasoning and problem-solving. It illuminates the path to understanding and helps us make sense of the world around us.
<|endoftext|>

Gemma

https://www.promptingguide.ai/models/gemma

<start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
Gemma<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn><eos>

Mistral-22B

https://huggingface.co/Vezora/Mistral-22B-v0.2

<s>### System: You are a helpful assistant.
### Human: Give me the best chili recipe you can
### Assistant: Here is the best chili recipe...</s>

Results

These are the results of using fine-tuned models on the same data. Despite the number of parameters being different they all score around the same.

model	LB score
Mistral 7B	0.62
Llama 2 13B	0.61
Mistral 22B	0.62
Mixtral 8x7B	0.61

I haven't made a submission with Gemma or Phi because considering that all other models score almost the same it is not promising and I have few submissions left.

Conclusion

Next steps

TODO

[ ]

Last update: 2024-04-14