← Back to Computer Vision cs.CV
Teaching vision-language models to improve themselves without human feedback
Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu
May 20, 2026
Vision-language models can improve themselves by generating questions and learning to answer them, but current methods suffer from degrading question quality and skill collapse. RISE addresses this with three mechanisms: tighter feedback loops between question generation and solver adaptation, a quality supervisor to validate generated questions, and skill-aware balancing to prevent the model from repeatedly practicing the same narrow capabilities. Experiments across seven benchmarks show consistent improvements over baseline models, with code released.
Read the original paper →