Iteration 44. Learn to use Strong Compute
22-10-2024
Goal
Learn to use the Strong Compute cluster.
Motivation
Strong compute has graciously granted me with compute credits to speedup my development in the last weeks of the challenge. That means I will have access to GPUs with 80GBs of memory and I could train much faster than with the Veridas cluster.
Development
Quick guide to connect to Strong compute
- Go to Strong compute control panel and start a workstation.
- Start the vpn with
sudo wg-quick up wg0
- Connect to the workstation using vscode
- After all the work has been done disconnect the vpn
sudo wg-quick down wg0
- And stop the workstation
Creating a python environment for the experiments
I already have the requirements on requirements.txt
file, so I just have to clone the repo into
the workstation.
To do so I have created an ssh key doing ssh-keygen
on the workstation and added the public key to github.
cd ~/code/arc24
python3 -m virtualenv ~/envs/arc24
source ~/envs/arc24/bin/activate
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
First training
isc_project_id = "46f4672b-2489-457f-b302-eab855b36b70"
experiment_name = "first_arc24_training"
gpu_type = "24GB VRAM GPU"
gpus = 8
compute_mode = "burst"
output_path = "~/outputs/first_arc24_training"
command = "source ~/envs/arc24/bin/activate && cd ~/code/arc24/scripts && ~/jobs/job.sh"
burst_shape_priority_list = ["oblivus-mon1-h100n"]
Better configuration with the whole command:
isc_project_id = "46f4672b-2489-457f-b302-eab855b36b70"
experiment_name = "first_arc24_training_A100n_v2"
gpu_type = "24GB VRAM GPU"
gpus = 8
compute_mode = "burst"
dataset_id = "0cfd54a3-4096-494e-93d5-a073126e81e2"
output_path = "~/outputs"
burst_shape_priority_list = ["oblivus-mon1-a100n"]
command = '''
source ~/envs/arc24/bin/activate &&
source ~/jobs/secrets.sh &&
accelerate launch --num_processes 8 --num_machines 1 --mixed_precision bf16 --multi_gpu
~/code/arc24/scripts/fine-tuning.py
--max_steps 10000
--model_path=Qwen/Qwen2.5-0.5B-Instruct
--lora_r 64
--output_dir ${OUTPUT_PATH}/models/20241022_no_training/08_A100n_lora064-Qwen2.5-0.5B-Instruct_lr1e-4_bs16_10000steps_2gpus_8192msl
--n_gpus=8
--batch_size=16
--device_map None
--no-verbose
--compose_new_task_probability 0.5
--compose_new_task_weights 1 1 1 1
--max_seq_len 8192
--learning_rate=1e-4
--train_datasets ~/code/arc24/data/original_data/arc-agi_evaluation_challenges.json output-from-examples-v1
--train_datasets ~/code/arc24/data/external_data/kaggle.json output-from-examples-v1
--train_datasets ~/code/arc24/data/external_data/pqa-dataset-1k.json output-from-examples-v1
--train_datasets ~/code/arc24/data/external_data/neoeye_tama.json output-from-examples-v1
--train_datasets ~/code/arc24/data/external_data/MINI-ARC.json output-from-examples-v1
--train_datasets ~/code/arc24/data/original_data/arc-agi_evaluation_challenges.json input-from-inputs-v0
--train_datasets ~/code/arc24/data/external_data/kaggle.json input-from-inputs-v0
--train_datasets ~/code/arc24/data/external_data/pqa-dataset-1k.json input-from-inputs-v0
--train_datasets ~/code/arc24/data/external_data/neoeye_tama.json input-from-inputs-v0
--train_datasets ~/code/arc24/data/external_data/MINI-ARC.json input-from-inputs-v0
--val_dataset ~/code/arc24/data/original_data/arc-agi_training_challenges.json output-from-examples-v1
--remove_train_samples_to_fit_max_seq_len
--eval_steps 200
--warmup_ratio 1e-1'''
Initial idea
- Since my datasets are small I believe I can work on the root folder.
- They have said that it only has sense to use GPUs in multiples of 8.
- H100 is newer and faster than a100
H100 vs A100
https://oblivus.com/pricing/
The Nvlink machines are slightly more expensive than the pcie. For multi-gpu training it should be faster so probably it's better to just avoid using the PCIE machines.
The H100 is more expensive than the A100, we have to see if the speedup is worth it.
Debugging burst errors
We can find a .tar.zst
file in the exports folder. We should copy it first to a different folder
because the exports folder is a fused folder. Then we can untar it.
cp exports/183e895a-bbb1-4e3a-b9e8-f3ee02c5e5cb.tar.zst copied_exports
apt-get install -y zstd
tar --use-compress-program=unzstd -xvf 183e895a-bbb1-4e3a-b9e8-f3ee02c5e5cb.tar.zst
Copy big datasets
scp -P 51468 evaluation_v0.json root@192.168.127.70:~/code/arc24/data/verifier
scp -P 51468 training_v1.json root@192.168.127.70:~/code/arc24/data/verifier
scp -P 51468 /mnt/hdd0/Kaggle/arc24/data/rearc/v2/re-arc.json root@192.168.127.70:~/code/arc24/data/external_data
# other machine
scp -P 50022 /mnt/hdd0/Kaggle/arc24/data/verifier/evaluation_v0.json root@94.156.8.239:~/code/arc24/data/verifier
scp -P 50022 /mnt/hdd0/Kaggle/arc24/data/verifier/training_v1.json root@94.156.8.239:~/code/arc24/data/verifier
scp -P 50022 /mnt/hdd0/Kaggle/arc24/data/rearc/v2/re-arc.json root@94.156.8.239:~/code/arc24/data/external_data
Continue from checkpoints
The trainings are unstable and fail without explanation with strong_fail
. The good thing is that
those fails do not cost me money, the bad thing is that it requires me to baby sit the experiments.
However it trains 6x faster so it is worth it.
I believe I have to implement two scripts:
- Copy checkpoints from failed experiments. It will copy the last checkpoint from a failed experiment to a common folder in the root directory. I will call this script manually.
- When starting a new training, check if there are checkpoints available and copy them. This way the training will continue from the last checkpoint.
Useful commands:
for checkpoint in af05f40c-f0b0-417c-8285-bc77f2978c61 57f38e65-f2cd-4777-80cd-004dd6da45b6 191aed3c-9409-48b5-81d1-4ed738c9b44d e4c6b8dc-415a-41cc-9034-4b1a70b7fce4 a4891197-4253-4fb2-89e6-d1a1696b01a6 9fafa3c2-5272-46a8-87ea-19ae39f4de4b f4a9c4fe-46a4-490f-9eb9-fde3d9c991e8; do copy-checkpoint ${checkpoint}; done
for submission in *; do isc train ${submission}; done
Results
First training with 8xA100 is 5x times faster than using 2xA6000.
However in the following days I have been unable to run new trainings, and finally after 2 days
of struggling I have managed to run two jobs (on A100 and H100) but they were 4 times slower than the
previous run and they have finished with strong_error
after less than 2 hours of running.
Today is Sunday and I have been able to run 3 fast trainings, one slow. I don't see any speed difference between A100 and H100. I have tried using a batch size of 2 per GPU but it is not faster.
Conclusion
So far it seems that Strong Compute cluster is very unstable. But we have seen that we can train at least 5 times faster in a machine with 8xA100. So we could go directly to a cloud provider and do a fast training if necessary.
Next steps
TODO
- Create a python environment for the experiments
- Copy the data and code to the ISC (instant super computer)
- Train a submission model with strong compute
- How much faster can I train?
- Differences between A100 and H100
- https://huggingface.co/docs/accelerate/en/usage_guides/low_precision_training On H100
- Multi-line submit files TOML https://toml.io/en/