Iteration 39. Reduce VLLM RAM usage
11-10-2024
Goal
If I can reduce the RAM usage from VLLM ensembling would be easier in the Kaggle submission.
Motivation
Currently I cannot parallelize everything in the submission pipeline because VLLM uses 50% of RAM and the 2020 solution sometimes demands more than 50%.
Development
I have been playing with VLLM parameters and swap_space
seems to be the one with the biggest effect on RAM usage. In the documentation it says:
CPU swap space size (GiB) per GPU.
export checkpoint_folder=/mnt/hdd0/Kaggle/arc24/models/20241007_batch_size/01_bs16_lr5e-5_Qwen2.5-0.5B-Instruct_10000steps_2gpus_8192msl/checkpoint-10000
rm /mnt/hdd0/Kaggle/arc24/evaluations/20241007_batch_size/01_bs16_lr5e-5_Qwen2.5-0.5B-Instruct_10000steps_2gpus_8192msl/checkpoint-10000/inference_evaluation_x009.json
python easy_inference_and_evaluation.py "${checkpoint_folder}" --dataset_path /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json --predictions_per_task 9
Results
Local results
swap_space | RAM usage | inference time (s) |
---|---|---|
4 | 16 | 530 |
2 | 9.7 | 563 |
1 | 5.5 | 514 |
0 | 1.1 | 508 |
We have reached an enormous VRAM decrease without a significant effect on inference time nor in accuracy. This results were obtained on my PC, I should repeat the experiments on Kaggle.
Kaggle results
swap_space | RAM usage | inference time (s) |
---|---|---|
4 | 50% | 84 |
2 | 32% | 79 |
1 | 22% | 79 |
0 | 12% | 78 |
We see the same trend, we can decrease the RAM usage by a lot while the inference time almost does not change.
Remember icecuber RAM usage
On this notebook we can see that sometimes icecuber uses up to 80% of the RAM. So if we want to run it on parallel
with VLLM inference we will have to use no swap_space
.
Conclusion
We have found that we can decrease the RAM usage of VLLM dramatically apparently without any effect on accuracy nor on speed (at least for ARC inference.)
Next steps
- Create a new notebook for Omni-ARC inference that runs all 2020 solution in parallel. On a first step it tries to solve the tasks using code. Then it does test-time fine-tuning on the remaining tasks.
TODO
- [ ]