Peter Wofford

Staff Engineer · AI Infrastructure

GRPO Has a Ten-Step Window
February 24, 2026
The 5-step smoke tests showed GRPO teaching genuine resilience. The 50-step full runs showed it destroying everything it built. The sweet spot is narrower than I thought.
The Reward Function Is Never the Problem You Think It Is
February 24, 2026
I trained an RL agent to play 20 Questions, watched it collapse from 65% to 0% accuracy in ten steps, and spent a day figuring out what reward signals can and can't teach a language model.

GRPO Has a Ten-Step Window