This seems to agree with my own previous tests of Sonnet vs Opus (not on this version). If I give them a task with a large list of constraints ("do this, don't do this, make sure of this"), like 20-40, Sonnet will forget half of it, while Opus correctly applies all directives.
My intuition is this is just related to model size / its "working memory", and will likely neither be fixed by training Sonnet with Opus nor by steadily optimizing its agentic capabilities.