Skip to content

Data Understanding

Collect initial data

External data

Describe data

ARC-AGI-2

  • Addressing Flaws: Removed task that were susceptible to brute force search from the evaluation and test set (50% of the test tasks could be solved with an ensemble from 2020), also removed tasks with contamination from training tasks.
  • Compositional Tasks: ARC-v2 features compositional tasks with multiple interacting rules, making it harder for brute-force methods.
  • Solvability: All the tasks are solved at least but 2 persons out of a maximum of 10, and the average solving rate is 60%.
  • Human Calibration Study: A formal human calibration study was conducted to assess how humans perform on the tasks. All the evaluation and tests sets should have a similar difficulty.
  • The new training dataset has 1000 tasks, it has almost all the previous ARC-AGI-1 tasks as shown in this notebook.
  • Not adversarial with ARC24 models. Although we see a huge drop in accuracy compared to ARC-AGI-1, this is caused by the higher complexity of the tasks. They could have made o3 to score 0 but they didn't do it.

Explore data

Verify data quality

Amount of data