Iteration 33. Back to SmolLM

28-09-2024

Goal

Train with SmolLM models to see if we can reach similar accuracy to Qwen but with faster models.

Motivation

I have recently tried the new Llama 3.2 1B and it was better than Qwen but slower. I have the intuition that a small model trained for longer could reach the same accuracy as a bigger model. But this smaller model could be test-time fine-tuned for more steps or do more predictions.

Development

In the SmolLM blog they say the following:

For all three models we use embedding tying and a context length of 2048 tokens. This context length can be further extended with some long context fine-tuning.

Let's see if we can really train the models with a bigger context length and they work well at inference.

I'm going to go directly for the smaller model SmolLM-135M-Instruct because there is a 360M parameter model but that is very close to Qwen's 500M.

Tokenizer analysis

Local experiments

Click to see bash commands

# baseline, 492 seconds, 4.9 seconds/it
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/01_baseline \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4

# Try to increase per_device_train_batch_size but get OOM
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/02_bs2 \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4 \
--per_device_train_batch_size 2

# train on a single gpu, 338s, this uses ~21GB of VRAM, 3.3 seconds per iteration
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/03_1gpu \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4

# Reduce the msl to 2048, now it only uses 7GB of VRAM, 294s, 2.9 seconds per iteration
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/04_1gpu_2048msl \
--max_seq_len 2048 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4

# 186 seconds, 1.8 seconds per step
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--lora_r 32 \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM/05_1gpu_2048msl_pdbs2 \
--max_seq_len 2048 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 1e-4 \
--per_device_train_batch_size 2

It is training at a speed of 1.8 seconds per step on a single GPU and with a max_seq_len of 2048. For reference Qwen trained at 6 seconds per step and Llama at 9 when being trained on 2 gpus. So potentially we are looking at a speedup of 6-7. If we are able to train SmolLM to a similar accuracy to Qwen this would be game changing.

How to increase the context length

https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/discussions/8
ChatGPT suggestions to increase the context length
More educated information from ChatGPT
https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models
https://blog.eleuther.ai/rotary-embeddings/
Is it true that context window can be safely doubled without repercussions? #7206
https://huggingface.co/docs/transformers/main/model_doc/llama2#transformers.LlamaConfig.rope_scaling
Configuration samples:
https://wandb.ai/byyoung3/ml-news/reports/Scaling-Llama-2-to-32k-Tokens-With-LongLora--Vmlldzo1NzU4OTk2

It seems that theta determines the original context. If a longer context is needed it seems that all the people use the rope_scaling.

Click to see bash commands

# train on a single gpu, 338s, this uses ~21GB of VRAM, 3.3 seconds per iteration
export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/01_baseline-full-fine-tuning \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/02_change-model-config \
--max_seq_len 10240 \
--device_map None \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/03_change-model-config-longer \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=1
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/04_longer-baseline \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

export CUDA_VISIBLE_DEVICES=0
python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/05_rope-scaling-02 \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

python fine-tuning.py \
--model_path /home/gbarbadillo/data/SmolLM-135M-Instruct \
--n_gpus 1 \
--no-use_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/train_rs7.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/val_rs7.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20240928_debug_SmolLM_context_window/07_linear-rope-scaling-2-update-tokenizer \
--max_seq_len 10240 \
--device_map None \
--max_steps 1000 \
--warmup_ratio 1e-1 \
--logging_steps 10 \
--batch_size 16 \
--verbose \
--random_seed 7 \
--learning_rate 4e-4

AMD-Llama-135m

This model encodes each number independently, just like SmolLM.

It is not an instruct model and it does not have a chat template.

https://huggingface.co/docs/transformers/main/en/chat_templating#advanced-adding-and-editing-chat-templates

Current problems

model_max_length in tokenizer, does not seem to be saved correctly, I have manually fixed it.

It makes very long predictions, just like non-instruct Qwen models.

The baseline model that was simply fine-tuned does not work with VLLM because the following error:

ValueError: User-specified max_model_len (10240) is greater than the derived max_model_len (max_position_embeddings=2048 or model_max_length=None in model's config.json). This may lead to incorrect model outputs or CUDA errors. To allow overriding this maximum, set the env var VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

If I modify the model configuration to increase the max_model_len it works, but it seems to be predicting special tokens all the time because the predictions appear to be empty, but it takes 3599s to do the predictions. If I modify the inference script to just use 2048 it does the same thing with the predictions but faster. Thus it appears that without modifications the model cannot work correctly.

I have manually modified the number of gpus to 1. File "/home/gbarbadillo/miniconda3/envs/arc/lib/python3.10/site-packages/vllm/config.py", line 337, in verify_with_parallel_config raise ValueError( ValueError: Total number of attention heads (9) must be divisible by tensor parallel size (2).

Results

Training metrics

Wandb metrics

As a reference when training Qwen or Llama for 10k steps I could reach a train and validation loss around 0.08. Training for 80k steps could reduce the train loss to 0.03, but the validation loss did not improve.

Training SmolLM reaches a min validation loss of 0.10 and 0.058 train loss.

Validation metrics

model	training steps	accuracy	pass_32	vote_2
Qwen2-0.5B	10000	8.24%	26.50%	15.91%
Qwen2-0.5B-Instruct	10000	8.25%	26.75%	15.91%
Qwen2-0.5B-Instruct	80000	13.78%	33.88%	23.11%
Qwen2.5-0.5B	10000	9.37%	26.75%	18.31%
Qwen2.5-0.5B-Instruct	10000	8.98%	26.00%	17.93%
Llama-3.2-1B	10000	10.25%	29.00%	19.88%
SmolLM-135M-Instruct	40000	8.40%	23.00%	16.54%
SmolLM-135M-Instruct	140000	8.58%	24.12%	16.88%

Despite training for much longer I have not been able to reach the accuracy of Qwen models that are trained just for 10k steps.

Inference speed

Inference takes 1231 seconds on a single GPU, by comparison Qwen takes 1809 using two GPUs.

Conclusion

The smaller SmolLM model trains 6-7 times faster than Qwen and inference is 3 times faster.

However I have not been able to reach similar validation accuracy to Qwen despite training for much longer.

Next steps

After fixing the problem with the tokenizer, I could now train Qwen2.5 non-instruct model and have fast inference.
I could revisit this iteration and try with other small models at the end of the challenge once the data is fixed and I just have to try different models. I might try different position encoding variations.

TODO

What is the speedup when training?
Train a model for 10k steps to find what is the optimal learning rate -> 8e-4
Does the evaluation return comparable metrics to Qwen?
What is the speedup at inference?
Try to get the same metrics as Qwen by training for much longer, f.e. 160k steps
Another small LLM, it also has 2048 context window: https://huggingface.co/amd/AMD-Llama-135m