Iteration 28. Refine predictions
01-09-2025
Goal
Study if asking the model to refine its prediction is helpful
Motivation
All the evolutionary search approaches use the model to refine its most promising solutions.
I want to explore:
- Does the GPU 25GB VRAM allow to do prediction refine with BARC induction model?
- How much improvement do we get compared to doing independent predictions?
Development
Estimate the number of tokens
Without refining the longest tasks are those which have 4 training tasks of input and outputs with shape 30x30 and the test task
is also 30x30. If we consider the newline token that accounts for 8370=30*31*(4*2+1) just for the tokens.
In my case adding the prompt increases the token count to 8650.
When we refine the token we have to add:
- Code generated by the model: 1000 tokens max
- Outputs of the training samples: 3720 tokens max
Thus without considering any message in the prompt it would be 13090 tokens. Being conservative we could request for 13500 tokens in the refining prompt, and a total sequence length of 14500 tokens considering that we allow to predict 1000 tokens.
How much VRAM is needed for 14500 sequence length?
When using unsloth I need 0.75 of the 3090 VRAM to be able to make those predictions, with VLLM is enough with 0.5.
If I don't quantize the model to 4-bit then I need at least 0.8 memory with VLLM.
Experiment design
The easiest experiment is to create a notebook where I just do solution refinement. This implies I already need to have the solutions generated and saved to disk. Probably the easiest way is to reuse predictions from search and learn experiments.
I could select n random unsolved predictions for each task, and compare the accuracy against the baseline that does not use prediction refinement.
Generate predictions to refine
export N_PREDICTIONS=8; python scripts/search_and_learn_with_unsloth.py \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-10-08-generate-predictions-to-refine/${N_PREDICTIONS}i \
--initial-predictions ${N_PREDICTIONS}
Results
| initial predictions | refinement predictions | valid code | valid outputs | unique outputs | train_pixel_score | train_correct_grids | train_pass_rate | train_is_correct | test_pixel_score | test_correct_grids | test_pass_rate | test_is_correct | is_correct |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 128 | 0 | 99.9% | 71.7% | 49.8% | 42.1% | 2.4% | 1.6% | 16.3% | 40.9% | 2.0% | 2.0% | 23.0% | 16.3% |
| 64 | 64 | 99.7% | 74.0% | 43.7% | 45.8% | 2.1% | 1.1% | 16.5% | 44.4% | 1.7% | 1.6% | 21.5% | 16.0% |
The baseline makes 128 predictions per task, and the contender does 64 initial predictions, selects the most promising ones (that didn't solve the train set) and refines them.
The table shows that there is no clear difference between the approaches. Both solve almost the same number of tasks: 16%.
Conclusion
I have tried to refine predictions with the BARC induction model but results did not improve over just making independent predictions.
| experiment | pass@128 |
|---|---|
| baseline (no refinement) | 16.3% |
| refine predictions | 16.0% |
Frontier models benefit from refining its predictions, but this 8B model does not. The model was finetuned just to make predictions, not to refine them. Very likely that ability could be developed with reinforcement learning.
Next steps
Focus on RL and search and learn. No more time for refinement.
TODO
- How much memory is needed to do refinement? Estimate the number of necessary tokens and try with VLLM
- ~Collect predictions from previous experiments~ I have found that I wasn't saving all the required information.
- Modify search and learn to save the required information
- Create a notebook to see experiment with solution refinement