Iteration 17. Increase search diversity

30-07-2025

Goal

Can I increase search diversity by doing variations to the prompt?

Motivation

On the previous iteration I tried to increase diversity by feeding previously generated functions to the model in the prompt and asking for new approaches. The effect was the opposite and diversity was reduced. LLMs (at least small LLMs) struggle with negation.

In this issue I will take a different approach and instead I will add variation to the prompt:

Add hints about which DSL functions to use
Shuffle the order of the training samples
Prompt variations (I could prepare different prompts with LLMs)
Data augmentation
Temperature or other sampling parameters

The goal is to maximize diversity, but at the same time measure generation speed because generating multiple predictions with the same prompt is more efficient than making multiple predictions with different prompts.

Development

The work is done on the notebook 009_search_with_base_models

Results

I have done 8 predictions for each of the 400 training tasks from ARC-AGI-1. The metric of interest is the number of unique outputs, that is the best way to measure diversity.

experiment	valid code	valid outputs	unique outputs	dsl usage	pixel similarity	correct grids	solved task	inference time (s)	unique ratio
baseline	99.7%	78.6%	50.8%	56.3%	53.0%	1.9%	3.0%	667	64.64%
shuffle train samples	98.5%	74.1%	48.4%	56.8%	53.2%	2.0%	3.3%	1591	65.29%
prompt variations (8)	98.8%	73.7%	55.4%	41.1%	51.5%	1.9%	3.0%	2162.0	75.19%
dsl suggestions	98.9%	71.7%	40.8%	70.3%	54.0%	1.5%	2.3%	1706	56.91%
data augmentation	98.2%	71.5%	46.3%	57.3%	52.4%	1.8%	3.3%	1699	64.80%
data augmentation + shuffle train samples	98.2%	72.0%	45.4%	57.8%	52.9%	1.7%	3.0%	1674	63.09%
shuffle train samples + remove last train sample	98.5%	76.4%	46.8%	61.3%	54.1%	1.8%	3.0%	1336	61.33%
2 functions per prediction	99.1%	68.5%	46.2%	55.9%	52.2%	1.7%	3.3%	1058.0	67.44%
4 functions per prediction	98.0%	66.4%	45.9%	57.8%	50.1%	1.2%	1.5%	683	69.11%
8 functions per prediction	97.6%	73.5%	33.8%	36.5%	46.4%	0.8%	0.5%	535.0	45.92%

None of the experiments yielded an improvement in output diversity. When using prompt variations we can see an increase in 5% the unique outputs ratio, but at the cost of not using the dsl 15% less.
Doing multiple inferences per prompt is faster and has more variability than the other techniques tried
I find weird that data augmentation of shuffling the train samples does not increase the diversity.

I have to take in mind that the model can write different functions that generate the same output. Thus the relation between the metric and the generated code is complex and requires understanding of the effect of the code in the input data.

Conclusion

Surprisingly none of the tried techniques were able to increase the diversity of the outputs. I'm afraid I don't have yet a method to exhaustively search the solution space.

Next steps

How can we effectively explore the search space
- Funsearch
- Alphacode (Google DeepMind’s AlphaCode shows that this simple pipeline yields >90 % unique clusters even with off‑the‑shelf Transformer samplers)
Another source of variability is using multiple LLMs