Teaching vision-language models to improve themselves without human feedback

Vision-language models can improve themselves by generating questions and learning to answer them, but current methods suffer from degrading question quality and skill collapse. RISE addresses this with three mechanisms: tighter feedback loops between question generation and solver adaptation, a quality supervisor to validate generated questions, and skill-aware balancing to prevent the model from repeatedly practicing the same narrow capabilities. Experiments across seven benchmarks show consistent improvements over baseline models, with code released.