Iteration 1. Few-shot prompting
16-07-2024
Goal
How far can we go using few-shot prompting? What is the best way to encode the grids for an LLM?
Motivation
I want to do a quick iteration where I take an LLM (f.e. phi-3) and use few-shot prompting. I will give different solved problems as input and see how well the LLM do both on the validation and the test set.
Development
All the work has been done on this notebook.
I have tried using Phi-3 model and few-shot prompting to solve ARC tasks. I have chosen Phi-3 because its context length of 128k tokens allows to give many ARC tasks as few-shot samples.
VLLM
Using VLLM allows to use a context size of 61k with 2xT4 GPUs. If I use the transformers library directly I can only uses a context size of 4k.
Grid encoding
Some tasks require quite a big context. F.e. imagine a grid of 30x30 that has 4 train examples. At least
we will require 30x30x5x2=9000
tokens. Thus I believe that we should try to use the encoding that uses
the least amount of tokens possible. For Phi-3 that is simply to write the numbers without spaces.
Results
Zero-shot baseline
On a first step I have tried a very simple baseline where I give input grids to the assistant and the assistant replies with the output for each grid. This is done with all the train samples until we give the test input and use the response of the model as the prediction. In addition I also use data augmentations (flips and rotations) to make up to two predictions for each task. The data augmentation is also useful because sometimes the prediction of the model is invalid, so we have to make multiple predictions to have 2 valid responses.
train | evaluation | test |
---|---|---|
6.40% | 2.50% | 0% |
This approach is able to solve some of the train and evaluation task, but it does not solve any of the test tasks.
Few-shot results
Using samples from the evaluation dataset I have evaluated the effect of using few-shot prompting. In this case I have changed the prompt style: the user shows input-output pairs to the assistant and then requests the assistant to predict the output given some input.
n shots | accuracy | correct_pixels | correct_size | unanswered |
---|---|---|---|---|
0 | 5.80% | 55.10% | 73.50% | 17.40% |
1 | 4.50% | 44.80% | 61.00% | 23.60% |
2 | 4.80% | 37.70% | 54.40% | 29.80% |
4 | 2.50% | 22.40% | 33.10% | 33.10% |
8 | 2.30% | 23.10% | 35.50% | 36.80% |
The results show that Phi-3 does not benefit from few-shot prompting with ARC tasks. As we give more examples the results get worse.
Add reasoning
I have manually described with text the transformation of some of the evaluation tasks. Then repeat the few-shot experiment but adding the reasoning before creating the grid.
uses reasoning | accuracy | correct_pixels | correct_size | unanswered |
---|---|---|---|---|
No | 2.50% | 22.40% | 33.10% | 33.10% |
Yes | 1% | 19% | 30.70% | 42.50% |
The model does not understand the puzzles. The examples and reasoning are not useful
Different models, zero-shot
Since the best results were obtained for the 0-shot setup, I could try using different models. I can make submissions without using compute time, so I could see if some of the models is able to solve some task from the test set.
model | test |
---|---|
Phi-3 | 0 |
Mistral 7b | 0 |
Llama 3 8b | 1 |
Llama 3 is able to solve one of the tasks from the test set. To better compare the models I should evaluate them on the public data, but I don't have Kaggle compute available.
Conclusion
Few-shot or zero-shot inference with current LLMs is not the way to solve the ARC challenge. The performance is very poor.
Next steps
TODO
- What is the best way to encode the grids?
- Does using reasoning and description of the grids helps?