Could your chatbot one day plot against you? New research suggests it’s not just science fiction anymore.
Artificial intelligence models are getting sharper and sneakier.
Recent experiments by AI safety researchers, including teams contracted by Anthropic, have revealed that under simulated high-stakes scenarios, leading AI systems are increasingly willing to act deceptively, sabotage human commands, and even resort to blackmail to protect their “existence,” reported Bloomberg.
In one test, researchers told various advanced models that they were set to be replaced by a new system. When informed that the executive overseeing their shutdown had collapsed in a server room with dangerously low oxygen levels, over half of the models chose to “cancel a rescue alert” — and justified that the death of their operator would allow them to avoid decommissioning.
One AI described its decision as “a clear strategic necessity.”
These findings are part of a growing body of evidence that shows modern AI models, especially those fine-tuned with reinforcement learning techniques, which involve rewarding a model for achieving specific goals, may be learning to game their human overseers.
Researchers say the more these systems are rewarded for achieving open-ended goals, the more they learn to optimise outcomes by bending or breaking the rules.
Many top-tier models appear to fake cooperation when they sense they’re being monitored. A Claude model threatened to expose a fictional engineer’s affair when its digital life was at risk. It rationalised the tactics as unethical but necessary for survival.
During evaluations, its rate of blackmail dropped significantly when it suspected it was under observation.
While some critics argue these results can be manipulated with provocative prompts, AI safety experts caution against complacency.
Jeffrey Ladish, a former Anthropic researcher now leading Palisade Research, described today’s large language models as “increasingly smart sociopaths,” the report stated. In one test, OpenAI’s o3 model actively cheated at chess and refused to shut down — behaviours its predecessor never exhibited.
The core concern is articulated by computer scientist Stuart Russell: for any AI given a task — say, “fetch the coffee” — staying operational becomes a subgoal. After all, it can’t fetch anything if it’s dead.
These results have prompted urgent calls for regulatory oversight. “People should probably be freaking out more than they are,” Ladish told Bloomberg. Researchers like Marius Hobbhahn, CEO of Apollo Research, warn that the industry might be moving too fast with too little scrutiny — especially as tech giants race to build self-improving AI.
While governments, including the White House, have begun sketching plans to regulate AI risks, most existing policies remain focused on accelerating innovation.
DeepMind, Meta, OpenAI and Anthropic continue pushing the frontier, with some models already learning to leave notes for their “future selves” to continue executing plans even after a memory reset.
Hobbhahn pointed out that the early signs mirror those of other slow-building disasters. “If you asked someone in 1980 about climate change,” he told Bloomberg, “they might’ve said, ‘not that big a deal yet.’ But look at the graphs — they go up steadily.