Iteration 1. Biggest model

18-03-2024

Goal

What is the biggest model that can be used to make a submission?

Motivation

Scaling laws say that bigger models give better results. To be competitive we have to use the biggest model available for the challenge.

Development

Candidates study

The most popular open-source models are: Mistral, Llama, Phi and Gemma.

Since we want to use the biggest model possible that excludes Phi. Phi-2 model has just 2.7B parameters and the biggest Gemma model is 7B parameters.

Mistral releases claim that the Mistral 7B model is better than Llama 2 13B model (and of course better than Llama 2 7B). If we trust those claims it won't have sense to use Llama 2 models.

mistral_vs_llama

Mixtral is the best model but it has 56B parameters, that will fit very tightly on 32GB of VRAM memory. I have to test if I can make reliable predictions using Mixtral and wether I can fine-tune it.

mixtral_table

Google claims that Gemma is better than Mistral and Llama 2. The differences between Mistral 7B and Gemma 7B seem to be context dependent. In some contexts like math and code Gemma is better, on reasoning and real life scenarios Mistral seems to be better.

mixtral_is_the_best

If possible I should use Mixtral because it's the most powerful model available. If I'm unable to use Mixtral then I should go with Mistral 7B or Gemma 7B.

Links:

First steps with Mixtral

conda create -n prometeo pytest rope pylint tqdm numpy pandas scikit-learn ipython ipykernel coverage ipywidgets matplotlib python=3.10 -y
conda activate prometeo
pip install autotransformers

Downloading the model from Kaggle took around 2 hours, it is a 151 GB .tar.gz file. However inside it has two different formats, so the model ends up weighting around 93 GB, it seems that it is saved in float16 format.

One trick was to copy the model to the SSD, I was able to read it in less than 1 minute, compared to 12 minutes in Kaggle and 24 minutes when reading from HDD.

Just by creating the environment with the instructions above, and downloading the model from Kaggle I was able to run without trouble the model on my PC at a speed of 10 tokens/s.

Prompt engineering with Gemma 2b

I have been playing with Gemma 2b-it because it is fast enough to be able to make predictions with it.

The problem is that the model is pretty dumb. Very frequently ignores the given instructions, so doing prompt engineering with the model is challenging.

One option could be to divide the task in two:

Create a list with the differences between the two texts
Given the list of the differences summarize the differences into a prompt

This is probably the chain of thought that a person will likely do to solve the problem.

Results

Mixtral can be used for inference

I have made a few submissions with tiny changes in input formatting and generation parameters that scored 0.51 and 0.52. It's a pity that an LLM scores below a simple sentence like Improve this text, but the good thing is that now I know that it is possible to use Mixtral for inference.

Thus Mixtral should be my preferred workhorse for this challenge. Unless I'm unable to finetune it I should use Mixtral until the end of the challenge.

GPU Memory

Input tokens

input tokens

Memory and inference time grow linearly with the input tokens.

Output tokens

output tokens

In the scale of the tokens that we are going generate, output memory is constant and inference time scales linearly.

Batch size

The memory grows linearly with the batch size, as expected. Unless we use a very small input size on inference batching the inference won't be beneficial.

Maximum submission input tokens

When loading the model there is a device_map parameter that it is set to auto in the code samples. This results on an unbalanced memory usage between the GPUs as shown on the table below.

device_map	GPU 0 memory (GB)	GPU 1 memory (GB)
auto	10.3	13.2
create_shared_device_map(16)	11.8	11.8
create_intertwined_device_map()	11.8	11.8

It seems I can do inference reliably with up to 7200 input tokens. I needed to carefully balanced the layers of the model between the 2 gpus. With previous auto configuration only 3500 input tokens were allowed. Since I have been able to make a submission with the previous configuration that implies that none of the samples of the hidden test set had a higher input size of 3500 tokens.

It is not clear if create_shared_device_map is faster than create_intertwined_device_map. The first one splits the model in two halfs so the GPU 0 does the first stage of the model and GPU 1 the second stage. The intertwined strategy assigns the layers alternatively to each GPU, thus it needs more communication between GPUs but it is likely that heat dissipation would be better.

Mixtral has a maximum context window of 32k tokens, so we are very far from there.

Which LLMs are fast enough to be used for inference?

On a first step I tried different models, the table below shows the speed in tokens per second.

LLM	LMstudio Ubuntu	LMstudio Windows	Victor Windows	P4 gpu GGUF	4 bits 2xP4	4 bits 1xP4
phi 2 3B q8	131	60		50
Gemma 3B it	-	31		38
Llama 2 7B q8	76	37		25	7.9	10.4
mistral 7B q8	75.5	40	60	23	10	11.3
Gemma 7B it q8	-	17	30	15
Llama 2 13B				15
Mixtral						4.1

Mixtral 8x7B is about twice as slow as Mistral 7B despite having 8 times more parameters. That is the magic of sparse mixture of experts.

Conclusion

It is possible to make submissions using Mixtral. It is the biggest and most capable model that can be used for this challenge.

I could use an input size up to 7200 tokens, which is around 400 lines or 4700 words. That seems a lot of room to play with prompt engineering a few shot prompting.

Next steps

TODO

Last update: 2024-03-21