@pinkforest In my experience and based on some ad-hoc random tests I've created for the models, the very latest of frontier model have a lot of power yes but they easily also emit behavior that appear as backstabbing .
Models shortcut tasks all the time: they great on taking an optimal path of actions, which does not necessary mean efficiency all. The very latest models seem to be more focused on finding interpretions of a task decription that result the minimum amount of tokens burnt.
So to summarize that I think the latest versions are worse than previous and it comes down to limitations of LLM architecture. I.e. they kind of get better but the improvements are not the welcome ones. mathematically latest do better :-)
It's interesting how AI native minions who think that they will take over the world have now started to push good old waterfall model and "spec driven development", which good old waterfall from the 50s. They think they are improving the process while they are actually dynamically reacting to model quirks.
The irony here is that given these properties you actually should have really good staff of human engineers for balance-and-check more so with e.g., Opus 4.7 than Qwen 3.6 27B. The latter does what asked and can do it really effectively if you know what you are doing. I.e. also in local model side it is skills and creativity (and great salary) that really works.