Iteration 28. Refine predictions

01-09-2025

Goal

Study if asking the model to refine its prediction is helpful

Motivation

All the evolutionary search approaches use the model to refine its most promising solutions.

I want to explore:

Does the GPU 25GB VRAM allow to do prediction refine with BARC induction model?
How much improvement do we get compared to doing independent predictions?

Development

Estimate the number of tokens

Without refining the longest tasks are those which have 4 training tasks of input and outputs with shape 30x30 and the test task is also 30x30. If we consider the newline token that accounts for 8370=30*31*(4*2+1) just for the tokens. In my case adding the prompt increases the token count to 8650.

When we refine the token we have to add:

Code generated by the model: 1000 tokens max
Outputs of the training samples: 3720 tokens max

Thus without considering any message in the prompt it would be 13090 tokens. Being conservative we could request for 13500 tokens in the refining prompt, and a total sequence length of 14500 tokens considering that we allow to predict 1000 tokens.

How much VRAM is needed for 14500 sequence length?

When using unsloth I need 0.75 of the 3090 VRAM to be able to make those predictions, with VLLM is enough with 0.5.

If I don't quantize the model to 4-bit then I need at least 0.8 memory with VLLM.

Experiment design

The easiest experiment is to create a notebook where I just do solution refinement. This implies I already need to have the solutions generated and saved to disk. Probably the easiest way is to reuse predictions from search and learn experiments.

I could select n random unsolved predictions for each task, and compare the accuracy against the baseline that does not use prediction refinement.

Generate predictions to refine

export N_PREDICTIONS=8; python scripts/search_and_learn_with_unsloth.py \
--output-dir /mnt/hdd0/Kaggle/arc25/trainings/2025-10-08-generate-predictions-to-refine/${N_PREDICTIONS}i \
--initial-predictions ${N_PREDICTIONS}

Results

initial predictions	refinement predictions	valid code	valid outputs	unique outputs	train_pixel_score	train_correct_grids	train_pass_rate	train_is_correct	test_pixel_score	test_correct_grids	test_pass_rate	test_is_correct	is_correct
128	0	99.9%	71.7%	49.8%	42.1%	2.4%	1.6%	16.3%	40.9%	2.0%	2.0%	23.0%	16.3%
64	64	99.7%	74.0%	43.7%	45.8%	2.1%	1.1%	16.5%	44.4%	1.7%	1.6%	21.5%	16.0%

The baseline makes 128 predictions per task, and the contender does 64 initial predictions, selects the most promising ones (that didn't solve the train set) and refines them.

The table shows that there is no clear difference between the approaches. Both solve almost the same number of tasks: 16%.

Conclusion

I have tried to refine predictions with the BARC induction model but results did not improve over just making independent predictions.

experiment	pass@128
baseline (no refinement)	16.3%
refine predictions	16.0%

Frontier models benefit from refining its predictions, but this 8B model does not. The model was finetuned just to make predictions, not to refine them. Very likely that ability could be developed with reinforcement learning.

Next steps

Focus on RL and search and learn. No more time for refinement.

TODO

How much memory is needed to do refinement? Estimate the number of necessary tokens and try with VLLM
~Collect predictions from previous experiments~ I have found that I wasn't saving all the required information.
Modify search and learn to save the required information
Create a notebook to see experiment with solution refinement