Solution Summary

This was my first hands-on experience with LLMs so I'm grateful to Kaggle and Google for giving me the incentive to do so. It was a fun competition although I was not able to find a way to consistently improve my LB score. Almost all the things I tried failed. It turned out that the black magic was an adversarial attack against the T5 embedding model.

solution logo

Solution in a nutshell

The solution is an ensemble of fine-tuned version with LoRA of the following models:

model
Mistral 7B
Llama 2 13B
Mistral 22B
Mistral 8x7B

The models were fine-tuned on a small dataset of around 1000k samples created with GPT4.

submission	public LB	private LB
MMML v1	0.64	0.65
MMML v1 + 0.63 prompt	0.66	0.67

The predictions of the models were concatenated and the 0.63 public prompt was also appended.

Links

What worked?

Making multiple predictions

Contrary to some beliefs expressed in the forum that argued that making long predictions was hurtful for similarity score, I found that simply making n predictions and concatenating them improved the LB score.

For example, I could improve the score from 0.62 to 0.64 simply by doing 4 inferences with v5 model.

Fine-tuning on my own data

Fine-tuning LLMs on my own data returned a LB score of 0.62. By comparison the best score obtained fine-tuning on public data was 0.60.

Few-shot prompt

This worked surprisingly well. Depending on the given samples, it was possible to score up to 0.62 on THE leaderboard. However, the inference was slow due to having such a big input prompt. Fine-tuning a model resulted on a much faster inference that allowed for multiple inferences.

Mean prompts

Making a submission with Please improve this text using the writing style with maintaining the original meaning but altering the tone. returned a score of 0.63.

This is probably a combination of using a poor chosen metric with a dumb model and the power of optimization of Kaggle, where 2k people make 5 submissions a day.

But it is possible that the mean prompt will work as well on the hidden test set, so I believe I should make an inference with and without the mean prompt.

What didn't work?

Validation dataset, what is the task to learn?

The lack of any validation data make very difficult to design and improve a solution for the challenge. I could only rely on a small public dataset with rounded precision.

At the start of the challenge I tried using some datasets for validation, but did not find a clear correlation between validation results and leaderboard.

Bigger models

At the start of the challenge I thought that Mixtral was going to make a difference in the challenge. In reality, I have not seen any significative difference between Mistral, Mixtral or Llama.

Public datasets

The best LB score obtained when fine-tuning on public data was 0.60, but most of the datasets scored around 0.53

Trying to improve my data

I tried different strategies to improve my data:

Multi-instruction prompts
Imitate leaked prompts
Prompts on more topics as suggested by GPT4
Added hints to datasets

None of them gave improvements on the leaderboard score.

Training with similar prompts

I tried augmenting the train data using similar prompts: prompts that had the same meaning and scored high on T5 embeddings It didn't improve the LB score.

Scaling the data

I did not get any improvement after generating 2k new samples using GPT4.

Using GPT4 to find a better mean prompt

Maybe I did not do enough attempts (around 8) but GPT4 was unable to find a better mean prompt given the public available mean prompts and its scores.

Criticism

There is ambiguity in the task of recovering a prompt. Multiple prompts can lead to the same output. Prompts could be generic or specific and lead to the same outcome. F.e. Rewrite the sentence using a different word choice. and Rewrite the sentence replacing beautiful by cute.

Thus, choosing similarity as the challenge metric was likely a poor choice. Instead, perplexity could have been used, although at the cost of higher computational cost.

Learnings

I have learned to use and fine-tune LLMs
I have learned that it is possible to quantize LLMs to float4 and they still work very well
I have learned that it is possible to do an adversarial attack against a text embedding model

Last update: 2024-04-18