Can small models learn to write better prompts for image generation?

Zijian Huang, Jay Zhangjie Wu, Zian Wang, Tianshi Cao, Jiasi Chen, Sanja Fidler, Huan Ling, Xuanchi Ren

Image generators are finicky about wording—identical requests phrased differently yield different results. Instead of relying on expensive APIs like ChatGPT to fix prompts, the authors trained small language models as specialized agents that rewrite and decompose prompts before feeding them to standard image generators. Their multi-agent version breaks complex requests into routing, rewriting, and composing steps, handling tricky compositional details (objects, attributes, spatial relations). On challenging benchmarks, these lightweight enhancers match closed-source prompt optimization while adding minimal latency and cost to the pipeline.