Iteration 45. Improve the verifier approach
26-10-2024
Goal
On the previous iteration we have seen signs that the verifier approach might work. Let's try to improve that approach.
Motivation
Having a more accurate method than voting to select predictions could improve the LB score.
Development
Create bigger datasets for training
Training set
By generating more wrong predictions I have increased the size of the training dataset from 48 to 130MB. The mean number of wrong predictions per training sample has increased from 54 to 155, and the total number of wrong predictions has increased from 92k to 267k.s
Evaluation set
I have created a first dataset with a mean number of wrong predictions per sample of 163, the file weights 260MB.
Add task augmentation to verification task
I have to refactor the code to enable using task augmentation with verification, because currently
it is only prepared for input
and output
grids, not for wrong_prediction
grid.
Click to see bash commands
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 128 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241026_debug_task_augmentation/01_baseline_no_task_augmentation \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/arc-agi_training_challenges.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 10 \
--logging_steps 1 \
--eval_steps 200 \
--batch_size 16 \
--learning_rate 1e-4 \
--max_seq_len 4096 \
--no-resume_from_checkpoint \
--random_seed 7 \
--verbose
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 128 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241026_debug_task_augmentation/02_task_augmentation_refactor \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/arc-agi_training_challenges.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 10 \
--logging_steps 1 \
--eval_steps 200 \
--batch_size 16 \
--learning_rate 1e-4 \
--max_seq_len 4096 \
--no-resume_from_checkpoint \
--random_seed 7 \
--compose_new_task_probability 0.5 \
--verbose
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 128 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241026_debug_task_augmentation/03_revert_refactor \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/arc-agi_training_challenges.json output-from-examples-v1 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 10 \
--logging_steps 1 \
--eval_steps 200 \
--batch_size 16 \
--learning_rate 1e-4 \
--max_seq_len 4096 \
--no-resume_from_checkpoint \
--random_seed 7 \
--compose_new_task_probability 0.5 \
--verbose
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 128 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241026_debug_task_augmentation/04_verify_no_task_augmentation \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/verifier/training_v0.json verify-output-from-examples-v0 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 10 \
--logging_steps 1 \
--eval_steps 200 \
--batch_size 16 \
--learning_rate 1e-4 \
--max_seq_len 4096 \
--no-resume_from_checkpoint \
--random_seed 7 \
--compose_new_task_probability 0.0 \
--verbose
python fine-tuning.py \
--model_path /home/gbarbadillo/data/Qwen2.5-0.5B \
--device_map None \
--lora_r 128 \
--output_dir /mnt/hdd0/Kaggle/arc24/models/20241026_debug_task_augmentation/05_verify_with_task_augmentation \
--train_datasets /mnt/hdd0/Kaggle/arc24/data/verifier/training_v0.json verify-output-from-examples-v0 \
--val_dataset /mnt/hdd0/Kaggle/arc24/data/arc-agi_evaluation_challenges.json output-from-examples-v1 \
--grid_encoder "GridShapeEncoder(RowNumberEncoder(MinimalGridEncoder()))" \
--max_steps 10 \
--logging_steps 1 \
--eval_steps 200 \
--batch_size 16 \
--learning_rate 1e-4 \
--max_seq_len 4096 \
--no-resume_from_checkpoint \
--random_seed 7 \
--compose_new_task_probability 0.5 \
--verbose
More efficient rank estimation using uncertainty
Currently I'm doing n verifications with all the predictions. F.e. I have seen that 32 verifications per prediction can be enough to select the best predictions.
This works, but my interest is just to select the best 2 predictions and I'm using a lot of compute to get the ranking for all the predictions. If I estimate the uncertainty for the verification ratio of all the predictions I could early discard wrong predictions and just focus on verifying the most promising predictions.
I also thought of using voting as a way to solve ties, but I won't have voting numbers for the 2020 solutions. So I should focus on improving the efficiency of estimating the ranking with a verifier model.
Probability of using a wrong prediction for training
The first implementation has hardcoded the probability of using a wrong prediction for training a verifier to 50%. It uses a balanced dataset of correct and wrong samples.
The problem with this approach is that we have around 1700 correct samples and around 270k wrong predictions. If we train for 8k steps with a batch size of 32 the model will have seen each correct sample an average of 75 times. In the other hand the model will have seen on average each wrong prediction 0.5 times. Maybe it has sense to decrease the frequency of using correct samples for training.
Results
Confidence level and verification time
without confidence it would take around 2300s
32 verifications max
80% 90% 95%
11440
2960 4168 5152
2448 2816 3296
1712 2480 2864
---------------------------
938s 1087s 1159s
23% more time when increasing confidence from 80% to 95%. It is probably worth it. I can reduce the time to 1100 seconds if I do 4 predictions per round instead of 8.
With this setup I could use up to 128 verifications per prediction in just 2036 seconds.
Does the verifier work on different models?
model | top 1 accuracy | top 2 accuracy |
---|---|---|
voting baseline | 60.00% | 70.00% |
model 1 | 62.90% | 80.40% |
model 2 | 55.90% | 72.80% |
model 3 | 59.00% | 79.10% |
We can see that the verifier can work on different models with a similar level of accuracy. Voting accuracy was almost exactly the same across all the 3 models. The current method does not seem to be better than voting when selecting the top 1.
Does training on a bigger dataset improve the accuracy?
Top 1 accuracy table:
training steps | baseline | more wrong predictions |
---|---|---|
4000 | 55.80% | 54.60% |
8000 | 61.70% | 54.20% |
16000 | 57.10% | 62.50% |
Top 2 accuracy table:
training steps | baseline | more wrong predictions |
---|---|---|
4000 | 72.90% | 72.50% |
8000 | 77.10% | 69.60% |
16000 | 74.60% | 78.80% |
It is unclear if adding more wrong predictions was beneficial.
Does using task augmentation improve the accuracy?
Top 1 accuracy table:
training steps | baseline | task augmentation |
---|---|---|
4000 | 54.60% | 52.50% |
8000 | 54.20% | 52.10% |
16000 | 62.50% | 54.2 |
Top 2 accuracy table:
training steps | baseline | task augmentation |
---|---|---|
4000 | 72.50% | 72.90% |
8000 | 69.60% | 72.10% |
16000 | 78.80% | 76.70% |
It is unclear if adding task augmentation improves the accuracy. In fact in other experiments the results are worse.
Can I achieve perfect accuracy if training on the evaluation set?
training steps | top_1 | top_2 |
---|---|---|
4000 | 63.30% | 81.20% |
8000 | 62.10% | 89.20% |
16000 | 70.40% | 93.80% |
32000 | 75% | 93.30% |
It is surprising that after 32k training steps the model still does not perfectly classify all the tasks from the evaluation set. After reviewing the failed predictions I have seen that in all the cases there were ties with other predictions.
On average it would have seen each task 320 times (16000/4/400*32
), so if a task has 4 samples it would
have seen around 80 times each sample.
Should I change the probability of training with a correct prediction?
correct_probability | top_1 | top_2 |
---|---|---|
0.1 | 48.30% | 68.30% |
0.2 | 56.20% | 79.20% |
0.3 | 52.10% | 79.20% |
0.5 | 50.00% | 79.20% |
There is no evidence that suggest that decreasing the probability of using correct predictions gives higher accuracy.
Does training for multiple tasks improve the accuracy?
Let's train new models from scratch:
- Add the new verify and select tasks, without task augmentation
- Qwen2.5
- Do the same also for submission, including the evaluation set
- Train for 40k steps with batch size 32.
model | lora_r | batch_size | training steps | top_2 accuracy | top_1 accuracy |
---|---|---|---|---|---|
Qwen2.5-0.5B | 64 | 32 | 4E+04 | 78.8% | 57.1% |
Qwen2.5-0.5B | 96 | 32 | 4E+04 | 72.1% | 47.1% |
NanoLM-0.3B-Instruct-v2 | 64 | 32 | 4E+04 | 62.1% | 39.2% |
NanoLM-0.3B-Instruct-v2 | 128 | 32 | 4E+04 | 63.7% | 40.4% |
SmolLM-135M-Instruct-20k | fft | 32 | 4E+04 | 58.3% | 38.0% |
We don't see improvements when training a multi-task model. However I have the feeling that these models are undertrained.
TODO: update results with the continuation of the trainings.
Submission results
When using a model to verify the predictions from LLM and 2020 solution I have only achieved a score of 33 when training on the whole ARC, and 30 when training only on the train dataset.
Conclusion
I have not been able to improve the accuracy of using a prediction verifier. It is still around 60% for top_1 selection and 80% for top_2 selection. Remember that voting gets 60% and 70%. Thus we only see an improving on top_2 prediction.
Next steps
- Could the verifier benefit from test-time fine-tuning?
- Could I improve the selection of predictions by using selection instead of verifying? I might create a select script by tweaking the verify script.
- If I define some correctness metric over the predictions, that could open the door to a much more training dataset that won't be using the correct prediction over an over. It is unclear if this would work better.
TODO
- Create bigger dataset for training
- Training set
- Evaluation set
- More data augmentation, allow task augmentation
- Maybe use an ensemble of models instead of a single model
- It's likely that a model trained in all the prediction tasks will perform better
- ~Use voting to solve ties~ I won't have voting on 2020 solution.
- I could make more efficient use of compute by using uncertainty and only making verifications for the predictions that are not significative different from the top prediction.
- Verify that it works on Kaggle
- Review all new code
- Experiments
- Does training on a bigger dataset improve the accuracy? IN PROGRESS
- Does using task augmentation improve the accuracy? IN PROGRESS
- Should I change the probability of training with a wrong prediction? IN PROGRESS
- Does training for multiple tasks improve the accuracy?
- Train new submission models
- Measure improvements over voting in other model predictions
- Maybe the model is not as accurate in the test set as in the evaluation set?
- Why cheating did not get perfect accuracy?
- How many verifications I have to do until it reaches the perfect ranking? 128 verifications does not reach significative differences. There are ties that avoid reaching the stop point.