Peter Wofford
Staff Engineer ยท AI Infrastructure
-
GRPO Has a Ten-Step Window
February 24, 2026The 5-step smoke tests showed GRPO teaching genuine resilience. The 50-step full runs showed it destroying everything it built. The sweet spot is narrower than I thought.
-
The Reward Function Is Never the Problem You Think It Is
February 24, 2026I trained an RL agent to play 20 Questions, watched it collapse from 65% to 0% accuracy in ten steps, and spent a day figuring out what reward signals can and can't teach a language model.