Iteration 7. Pause and analysis
03-04-2024
Goal
I have the tools (fine-tuning and few-shot prompt) but I do not know what is the task to learn. Let's analyze all the work done and think of ways to advance in the challenge.
Facts
- A simple prompt like
Improve the text to this
gets0.60
on leaderboard, while many few-shot prompts submissions with a big model like Mixtral score below that. - Some teams have been able to consistently and iteratively improve their score on the LB. They have slowly climbed up to
0.70
. - The uncertainty of the LB score is around
0.02
- The host has confirmed that the test set splits are random.
- By adding
Improve the text to this.
to the predictions of Mixtral I have seen consistent improvements of0.02
in LB score, from0.61
to0.63
. - On local validation I have been able to reach a score of
0.81
when training and evaluating on prompts recovered by GPT4. - When trying different datasets for few-shot prompt I have observed great variability on LB scores:
- Using my hand-made examples I get
0.61
- Using
newtonbaba
dataset I get0.62
- Using
alexxxem
dataset I get0.51
- Using my hand-made examples I get
- The sharpened cosine similarity metric does not penalize greater variance. If two distributions have the same mean, the one with the greater variance will have a greater sharpened score.
- Using a dataset of mean prompts and LB scores we could measure how similar are local datasets to LB.
- The outputs from Gemma have been likely post-processed because public datasets show that very frequently reveals the given prompt in the response.
- When evaluating prompt variations created by GPT4 that preserved the original meaning the score was always above
0.70
- Many different prompts can lead to the same response. A prompt can be generic or detailed an produce the same result.
- It seems that the style of the prompt is not relevant. I have done many prompt variations with few-shot prompting getting almost no variation on LB score.
Why a simple baseline is beating intelligent LLMs?
- Ambiguity of the task. If a generic prompt was used and a specific prompt is predicted the similarity will be low.
- The model is not guessing the prompt correctly. However on validation I get good scores, so this would imply that the test dataset is much harder than my validation datasets.
Motivation
Development
Results
I need actionable insights!!!
Conclusion
Next steps
My hunch is that the best solution is a fine-tuned Mixtral on the right data.
- What if I just fork the Mistral notebook and replace the model by Mixtral?
- Analyze the few-shot prompt examples with GPT4
- What if I request Mixtral to only answer when it is very sure and the prompt is evident? (Play safe)
- What if I just focus on building the best possible model and praise for the shakeup?
- Try again with perplexity/model inversion. But the problem is that the output has likely been post-processed.
- I could do more experiments on few-shot prompting, f.e. selecting random samples from a dataset
- Could I run some mean-prompt optimization using GPT4? I have read that some people have done that but does not transfer well to LB.
Option 1. Create a high quality hard dataset
The samples need to have sense, but at the same time be hard with my current best model. Then fine-tune a model on another dataset and measure the improvement.
This would work if we are failing on LB because the test set is hard.
TODO
- [ ]
Last update:
2024-04-05