Iteration 14. Optimize inference

26-06-2025

Goal

Is there room for improvement with the current approach if I modify the inference to better explore the solution space?

Motivation

On Iteration 12, solve a few arc tasks I saw very little exploration of the solution space. I want to try different ideas to see if we can fix the problem without redesigning the training and the strategy.

Development

Results

Max sequence length

I could generate and train on up to 32k sequences with a GPU of 24GB of VRAM.

Inference

I have verified that all the Qwen-Coder models listed below can generate 32k tokens on a 4090GPU. They differ in speed and on GPU utilization.

model	tokens/s
0.5B	49.8
1.5B	39.6
3B	27.4
7B	18.6

Training

If we use liger kernels and gradient checkpoint we can train the 0.5B model in a GPU with 24GB of VRAM and a sequence length of 32000.

We can train with up to 32k tokens with the 3B model, for the 7B model we can only reach 16k tokens. Notice how the training speed decreases with the sequence length.

Better sampling parameters

We can play with the temperature or top_p to induce more variability in the predictions, but still not enough. In the best scenario the rate of unique predictions is just 17% (45/256).

Temperature

Increasing the temperature decrease the number of valid predictions, but also can increase the number of unique predictions (there is a sweet spot)

Top_p

The same effect can be observed with top_p, there is a sweet spot for unique predictions.

Conclusion

A GPU with 24GB of VRAM is enough to make inference with a window size of 32k tokens, and we can train with 32k tokens for models up to 3B and 16k for the 7B models.

Playing with inference parameters was not enough to increase output diversity.

TODO

What is the max sequence length for training and inference on a GPU with 24GB of VRAM?
Better sampling parameters. Could play with temperature, top_k and top_p to create more diverse samples. https://huggingface.co/docs/transformers/v4.52.3/en/main_classes/text_generation#transformers.GenerationConfig.temperature
What if I give hints of how to solve the problem int the prompt? Is the model capable on that case?
What if I have a multi-turn conversation with the model to improve its own code?