Do robots learn better by breaking tasks into sub-skills?

Vision-language-action policies struggle to learn new tasks without expensive fine-tuning. Researchers trained two VLA architectures on assembly data using either raw trajectories or primitive-segmented episodes (broken into sub-skills), then tested few-shot transfer on held-out tasks using only 0–10 demonstrations. Primitive-trained models hit 78% of fine-tuned performance with 3 demos; flat-trained models needed 10. Ablating the primitive-decodable subspace of hidden states dropped transfer by 32 points, proving primitives are causally necessary, not coincidental.