Do robot arms complete tasks safely, or just recklessly?

SafeVLA-Bench evaluates whether robot manipulation policies actually execute safely, not just whether they reach the goal. The team added formal safety checks (Signal Temporal Logic specs) to existing benchmarks, measuring both unsafe successes and violation severity. Testing nine policies on LIBERO and RoboCasa-365 shows that high task completion masks serious problems: excessive contact, knocking over objects, and self-collision. Code and evaluation framework released.