← Back to Computer Vision
cs.CV

Teaching vision-language models to improve themselves without human feedback

Chaoran Xu, Yingmao Miao, Pengfei Zhang, Hao Dou, Lei Sun, Xiangxiang Chu

May 20, 2026

Vision-language models can improve themselves by generating questions and learning to answer them, but current methods suffer from degrading question quality and skill collapse. RISE addresses this with three mechanisms: tighter feedback loops between question generation and solver adaptation, a quality supervisor to validate generated questions, and skill-aware balancing to prevent the model from repeatedly practicing the same narrow capabilities. Experiments across seven benchmarks show consistent improvements over baseline models, with code released.
Published as RISE: Reliable Improvement in Self-Evolving Vision-Language Models arXiv:2605.20914
Read the original paper →