Skip to content

Iteration 33. Back to SmolLM

28-09-2024

Goal

Train with SmolLM models to see if we can reach similar accuracy to Qwen but with faster models.

Motivation

I have recently tried the new Llama 3.2 1B and it was better than Qwen but slower. I have the intuition that a small model trained for longer could reach the same accuracy as a bigger model. But this smaller model could be test-time fine-tuned for more steps or do more predictions.

Development

In the SmolLM blog they say the following:

For all three models we use embedding tying and a context length of 2048 tokens. This context length can be further extended with some long context fine-tuning.

Let's see if we can really train the models with a bigger context length and they work well at inference.

I'm going to go directly for the smaller model SmolLM-135M-Instruct because there is a 360M parameter model but that is very close to Qwen's 500M.

Tokenizer analysis

Local experiments

Click to see bash commands
# baseline, 492 seconds, 4.9 seconds/it
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/01_baseline \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4

# Try to increase per_device_train_batch_size but get OOM
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/02_bs2 \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4 \
--per_device_train_batch_size 2

# train on a single gpu, 338s, this uses ~21GB of VRAM, 3.3 seconds per iteration
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/03_1gpu \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4

# Reduce the msl to 2048, now it only uses 7GB of VRAM, 294s, 2.9 seconds per iteration
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/04_1gpu_2048msl \
--max_seq_len 2048 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4

# 186 seconds, 1.8 seconds per step
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/05_1gpu_2048msl_pdbs2 \
--max_seq_len 2048 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4 \
--per_device_train_batch_size 2

It is training at a speed of 1.8 seconds per step on a single GPU and with a max_seq_len of 2048. For reference Qwen trained at 6 seconds per step and Llama at 9 when being trained on 2 gpus. So potentially we are looking at a speedup of 6-7. If we are able to train SmolLM to a similar accuracy to Qwen this would be game changing.

How to increase the context length

It seems that theta determines the original context. If a longer context is needed it seems that all the people use the rope_scaling.

Click to see bash commands
# train on a single gpu, 338s, this uses ~21GB of VRAM, 3.3 seconds per iteration
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/01_baseline-full-fine-tuning \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/02_change-model-config \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/03_change-model-config-longer \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=1
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/04_longer-baseline \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/05_rope-scaling-02 \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/07_linear-rope-scaling-2-update-tokenizer \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

AMD-Llama-135m

This model encodes each number independently, just like SmolLM.

It is not an instruct model and it does not have a chat template.

https://huggingface.co/docs/transformers/main/en/chat_templating#advanced-adding-and-editing-chat-templates

Current problems

model_max_length in tokenizer, does not seem to be saved correctly, I have manually fixed it.

It makes very long predictions, just like non-instruct Qwen models.

The baseline model that was simply fine-tuned does not work with VLLM because the following error:

ValueError: User-specified max_model_len (10240) is greater than the derived max_model_len (max_position_embeddings=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

If I modify the model configuration to increase the max_model_len it works, but it seems to be predicting special tokens all the time because the predictions appear to be empty, but it takes 3599s to do the predictions. If I modify the inference script to just use 2048 it does the same thing with the predictions but faster. Thus it appears that without modifications the model cannot work correctly.

I have manually modified the number of gpus to 1. File "/home/gbarbadillo/miniconda3/envs/arc/lib/python3.10/site-packages/vllm/config.py", line 337, in verify_with_parallel_config raise ValueError( ValueError: Total number of attention heads (9) must be divisible by tensor parallel size (2).

Results

Training metrics

Wandb metrics

As a reference when training Qwen or Llama for 10k steps I could reach a train and validation loss around 0.08. Training for 80k steps could reduce the train loss to 0.03, but the validation loss did not improve.

Training SmolLM reaches a min validation loss of 0.10 and 0.058 train loss.

Validation metrics

model training steps accuracy pass_32 vote_2
Qwen2-0.5B 10000 8.24% 26.50% 15.91%
Qwen2-0.5B-Instruct 10000 8.25% 26.75% 15.91%
Qwen2-0.5B-Instruct 80000 13.78% 33.88% 23.11%
Qwen2.5-0.5B 10000 9.37% 26.75% 18.31%
Qwen2.5-0.5B-Instruct 10000 8.98% 26.00% 17.93%
Llama-3.2-1B 10000 10.25% 29.00% 19.88%
SmolLM-135M-Instruct 40000 8.40% 23.00% 16.54%
SmolLM-135M-Instruct 140000 8.58% 24.12% 16.88%

Despite training for much longer I have not been able to reach the accuracy of Qwen models that are trained just for 10k steps.

Inference speed

Inference takes 1231 seconds on a single GPU, by comparison Qwen takes 1809 using two GPUs.

Conclusion

The smaller SmolLM model trains 6-7 times faster than Qwen and inference is 3 times faster.

However I have not been able to reach similar validation accuracy to Qwen despite training for much longer.

Next steps

  • After fixing the problem with the tokenizer, I could now train Qwen2.5 non-instruct model and have fast inference.
  • I could revisit this iteration and try with other small models at the end of the challenge once the data is fixed and I just have to try different models. I might try different position encoding variations.

TODO

  • What is the speedup when training?
  • Train a model for 10k steps to find what is the optimal learning rate -> 8e-4
  • Does the evaluation return comparable metrics to Qwen?
  • What is the speedup at inference?
  • Try to get the same metrics as Qwen by training for much longer, f.e. 160k steps
  • Another small LLM, it also has 2048 context window: https://huggingface.co/amd/AMD-Llama-135m