Addressing Flaws: Removed task that were susceptible to brute force search from the evaluation and test set (50% of the test tasks could be solved with an ensemble from 2020), also removed tasks with contamination from training tasks.
Compositional Tasks: ARC-v2 features compositional tasks with multiple interacting rules, making it harder for brute-force methods.
Solvability: All the tasks are solved at least but 2 persons out of a maximum of 10, and the average solving rate is 60%.
Human Calibration Study: A formal human calibration study was conducted to assess how humans perform on the tasks. All the evaluation and tests sets should have a similar difficulty.
The new training dataset has 1000 tasks, it has almost all the previous ARC-AGI-1 tasks as shown in this notebook.
Not adversarial with ARC24 models. Although we see a huge drop in accuracy compared to ARC-AGI-1, this is caused by the higher complexity of the tasks. They could have made o3 to score 0 but they didn't do it.