Iteration 20. Data augmentation with BARC

21-08-2025

Goal

Does using data augmentation increases the diversity of the predictions and improves the pass@n metric?

Warning

There was a bug where only the training samples from each task were used on this iteration. The conclusions are still valid but go to Iteration 21 to see the results after fixing the bug.

Motivation

On a previous iteration with base models I found that data augmentation was not helpful. That result was weird, so I want to repeat the experiments with BARC.

Development

ARC-AGI-1 evaluation set is chosen

For these experiments I believe that evaluation set from ARC-AGI-1 has the greatest signal. The BARC model was able to solve around 20% of the tasks. The scores on the training set are not trustable because the model was trained on those or similar tasks, while the ARC-AGI-2 evaluation set is more difficult and only 2 out of 120 tasks were solved.

Experimental setup

The idea is to reuse all the data augmentation implemented on iteration 17. I will make predictions in batches of 8 or 16 predictions per task, and later I will aggregate all the predictions to estimate the accuracy of the system. I will have to save the data augmentation configuration alongside each prediction to be able to undo it when executing the code.

Results

Data augmentation improves the accuracy of the model by increasing the diversity of the predictions

The pass@n metric improves when using data augmentation. The difference is bigger when the number of predictions grows. This could explain why my previous experiments with just 8 predictions per task did not show improvements.

experiment	n_preds	valid code	valid outputs	unique outputs	pixel similarity	correct grids	pass_rate	pass@n
baseline	568	100.0%	75.9%	40.9%	57.1%	3.0%	1.96%	21.00%
data augmentation	584	100.0%	76.5%	44.4%	56.4%	2.9%	1.98%	24.50%

This is probably caused by having more diversity on the outputs, the metric that measure the unique outputs improves from 40.9% to 44.4%.

Trustability of the metrics

I have seen that we can only trust the metrics for a number of predictions around 1/4 of the total of predictions run (at least for pass@n metric). There is a bias to underestimate the pass@n rate when the number of predictions is small.

Thus when making comparisons between experiments we should try to have a similar number of predictions.

Distribution of output tokens

I was using the max_tokens=2048 from the previous iterations and it seems it is a good value. The median output tokens seems to be around 400, and we can see that the datasets are sorted by inference speed as expected. We could probably be using 1024 output tokens without much impact on the results. The important takeaway is that the current configuration is not hurting the accuracy of the model.

Conclusion

Next steps

TODO

Have a look at some of the solutions to verify they are legit implementations
Document some of the predictions
Check that I'm using the correct number of training samples. Maybe I should decouple from the Task object. Maybe I'm not giving all the training samples and making the problem harder. Indeed that is the case, I'm using just the training samples. So I have to fix that bug and repeat the experimentation.
Distribution of prediction length