Iteration 38. Make non-instruct models great again

11-10-2024

Goal

Modify the training of non-instruct models so they learn to stop the response at inference.

Motivation

I have made experiment with non-instruct models at the past, but they do not stop the response at inference and thus inference times are higher because they repeat the response over and over.

I have evidence that non-instruct models might give better results, but I have to find the way to train them correctly.

There must be an easy way to fix this.

Development

Experiment design

My idea is to fine-tune Qwen2.5-0.5B on a tiny dataset of just 5 samples. I will choose the smaller samples from the ARC tasks to train faster and have smaller VRAM requirements.

Then I will make inference and see if the model stops the responses or not.

Updating transformers and accelerate

The python environment is currently a little big unstable due to the installations I did for omni-arc.

I had to update both transformers and accelerate to make the training script work again on my computer.

pip install --upgrade transformers accelerate

Trainings

Click to see bash commands

# baseline
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/01_baseline \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose

accelerate launch fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/01_baseline_accelerate \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose


accelerate launch fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B-Instruct \
--device_map None \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/02_baseline_instruct \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose

accelerate launch fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--no-use_lora \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/03_full-fine-tune \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose


accelerate launch fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--no-use_lora \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/04_full-fine-tune_change_pad_token \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose


accelerate launch fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--no-use_lora \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/05_full-fine-tune_change_pad_token_fix_tokenizer_bug \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose


accelerate launch fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 32 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/06_final_experiment_with_lora \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 100 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose

accelerate launch fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 128 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/07_final_experiment_with_lora_longer \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 500 \
--logging_steps 10 \
--batch_size 16 \
--learning_rate 1e-4 \
--verbose

for checkpoint_folder in /mnt/hdd0/Kaggle/arc24/models/20241011_non-instruct_models/*/checkpoint-100; do
    python easy_inference_and_evaluation.py "${checkpoint_folder}" --dataset_path /mnt/hdd0/Kaggle/arc24/data/new_partitions/smaller_5_tasks.json --predictions_per_task 8
done

Training for 100 steps takes 7:46 minutes without accelerate. With accelerate just a little bit more than 2 minutes.

Results

I have made two improvements to the existing code:

Qwen tokenizer does not need to be resized, I just needed to change the eos_token to be the same as the instruct model.
I have added the tokenizer to the train function, that way it is saved in the checkpoint.

The problem was that the original tokenizer was being saved instead of the modified one, thus at inference there was a discrepancy between the model and the tokenizer.

I have been able to fine-tune the non-instruct version without any problem and make inference correctly.

However it seems that LoRA is not enough for the non-instruct version, if I want to use it I have to fully fine-tune the model.

Conclusion

I can now use the non-instruct models, but I have to fully fine-tune them. I cannot just use LoRA. Thus it is unclear if this will be useful.

Next steps

TODO

Create a small dataset for training and validation
Train, make inference and verify if it works