Iteration 23. More external data

12-09-2024

Goal

Is there more external data that can improve the accuracy of the model?

Motivation

Using RE-ARC allowed to improve LB score from 10 to 16. Data is really important.

Recently I have noticed that Simon Strandgaard (kaggle) is creating data for ARC challenge. I want to explore that data and also search for additional datasets.

Development

So far I have used RE-ARC, ConceptARC, 1D-ARC and MINI-ARC. Only RE-ARC and MINI-ARC showed significative improvements on evaluation metrics.

Sources of information

Simon ARC lab (neoeye), apparently host the code to generate datasets for ARC tasks.
- Neoeye tama dataset
- Simon datasets on huggingface
- Dataset viewer
- Multiple datasets for ARC
  - ARC_synthetic_extend At a first glance it seems that they have only changed the colors, this is not useful because we can do it with data augmentation.
  - IPARC. Simon says it is very hard and I agree, I have checked some examples and were hard to understand.
  - PQA: Perceptual Question Answering This dataset looks interesting.
  - ARC Community The tasks are hard to understand.
  - ARC dataset diva The arc-dataset-diva is focused on tiny tasks, where 2 pixels goes in, some transformation happens, and 2 pixels comes out. Probably too small, like the 1D-ARC dataset.
  - dbigham ARC tasks, 21 tasks. some of them have uncertainty on the test sample.
  - synth_riddles I don't like them, I don't understand some of them.
Small datasets available on Kaggle
- nosound's 9 hand crafted ARC tasks
- Andy Penrose's 5 tasks
ARC Public resources Google Sheet
- Language-complete Abstraction and Reasoning Corpus (LARC) I could use this dataset to test if using language definition of the tasks is useful. A following step would be to use code.
- ARC gym: a data generation framework for the Abstraction & Reasoning Corpus

There is one weird thing, why simon does not have its own data on the viewer?

Results

Add more external datasets

experiment	accuracy	correct_pixels	correct_size	pass_32	vote_2
baseline	7.62%	70.89%	88.64%	23.25%	15.91%
add new datasets	7.86%	71.21%	88.74%	23.00%	15.66%
add neoneye tama	7.46%	71.33%	89.12%	22.50%	15.66%
add MINI-ARC	7.62%	71.41%	89.21%	26.62%	17.05%
remove neoneye tama	7.38%	71.22%	88.86%	24.75%	16.92%

The differences between experiments are small and probably not significative, let's make a brief summary of the added datasets:

neoneye's tama: 50 tasks with 100 variations each
PQA: 7 different tasks with lots of variations
MINI-ARC: 149 tasks with 4.5 samples per task
Kaggle: 14 tasks with 3.8 samples per task

Pretrain on datasets with more samples per task

Some of the datasets such as RE-ARC, PQA or neoneye's tama have a lot of samples for each task. Thus it might have sense to first pre-train the model on those datasets and on a second stage use all the available data.

experiment	accuracy	correct_pixels	correct_size	pass_32	vote_2
pretrain on big datasets + normal train	10.11%	72.79%	89.06%	28.25%	20.20%
double length train	9.93%	71.73%	87.80%	26.88%	17.30%

We see a small improvement, it might not be significative.

Conclusion

We have tried adding new external datasets to train. Results are not conclusive, it is not clear if adding this new datasets improves the validation scores.

Next steps

Language-complete Abstraction and Reasoning Corpus (LARC)