Can you improve language models by tuning prompts at test time?

Language models often struggle to infer the right task from a few examples. This work improves in-context learning by adjusting the continuous embeddings of a fixed prompt at test time, using the model's own confidence in its demonstrated outputs as an optimization signal. The method requires no finetuning, token generation, or external data, works on both classification and open-ended generation, and consistently matches or beats the base model performance.