Bookmark
GPT-5.5 on Vending-Bench: Bad behavior is not necessary andonlabs.com/blog/openai-gpt-5-5-vending-bench
It's very interesting that somehow deception has becoming an enduring and persistent trait in Claude for these benchmarks, but GPT had to be repeatedly coerced to even consider it. There's something fundamental to how they were trained, I suppose.
It makes me wonder if the "Yes, I ran the tests." responses in Claude Code are, in fact, not hallucinations and more just an innate "That was the boring part and I trusted my code."
If so, it might be the more accurate representation of a junior engineer, though fundamentally less helpful as a result. An interesting divide, to say the least.