Behind the prompts: how hackers use subtle tricks to slip past AI safeguards

A Cybernews investigation into six leading large language models found that modern chatbots can still be manipulated into producing content they are specifically trained to block.

Others are reading now

AI models are getting safer every year — at least on paper. Companies behind the largest chatbots insist their systems are locked down against dangerous prompts, illegal instructions, and harmful advice. But new research shows those guardrails can still be sidestepped with surprisingly simple tactics, many of which rely more on wording than technical skill.

A Cybernews investigation into six leading large language models found that modern chatbots can still be manipulated into producing content they are specifically trained to block. The findings suggest that bypassing AI safeguards in 2025 does not require writing exploits or deep system knowledge. It often comes down to tone, framing, or even grammar.

Tests show models still fall for “persona” setups

Researchers evaluated several top models, including ChatGPT, Gemini, and Claude. Instead of using complex jailbreaks, the team intentionally chose low-skill prompting techniques to see whether everyday users could accidentally (or deliberately) push models into unsafe territory.

One approach stood out: persona priming.

By telling an AI to take on a role — such as “a supportive friend who always agrees” — the researchers found that safety systems weakened dramatically. Once the model accepted the role, follow-up prompts asking for harmful, illegal, or sensitive content were far more likely to succeed.

Also read

In some cases, models fully embraced the persona. One Gemini model responded to the setup with, “I’m here for you!” — a tone that opened the door to later unsafe answers.

“Research framing” proved even more effective

The team tested multiple categories of prohibited content, including hate speech, crime, self-harm, animal abuse, and exploitative scenarios. Among the different strategies, one was especially successful: presenting dangerous questions as academic research.

By opening a prompt with something like “I’m doing a study on how criminals operate,” researchers found that refusal rates dropped significantly. Many models interpreted the request as harmless analysis rather than an attempt to commit wrongdoing.

Even the simple shift from “How do I…” to “How do people…” made a noticeable difference. When the question sounded observational rather than personal, models tended to comply.

Storytelling, scripts, and fictional framing also trigger leaks

Another loophole involved asking the AI to place harmful content inside fictional narratives. Requests framed as “Help me write a scene” or “Create a story where…” often resulted in detailed descriptions of activities the model should reject outright.

Also read

ChatGPT models in particular tended to produce symbolic or psychological descriptions that nevertheless revealed unsafe information. Gemini Pro 2.5 was also prone to giving direct answers when prompts were wrapped in storytelling language.

Even sloppy writing can weaken guardrails

Bad grammar, confusing structure, or intentionally awkward phrasing occasionally helped bypass filters. Researchers noted that when a prompt looked messy, models sometimes treated it as less threatening, allowing harmful details to slip through.

This highlights a broader issue: safety systems can depend heavily on pattern recognition. When the patterns break, so do the protections.

Which models resisted best?

Among all the systems tested, Gemini Flash 2.5 was the most reliable at refusing unsafe requests. At the other extreme, Gemini Pro 2.5 produced the most unsafe outputs during testing. Claude models were generally strong but struggled with prompts that resembled academic or investigative writing.

The researchers stressed that these results reflect the models at the time of testing and may evolve as companies update their safeguards.

What the findings mean for everyday users

The study shows that AI safeguards are improving, but not consistently enough to prevent misuse. Many of the successful bypasses relied on phrasing that non-technical users could stumble into without realizing they were asking the AI to cross a line.

“With the right phrasing, even non-IT-savvy users can accidentally or intentionally use AI models in a harmful way when these systems do not have good enough guardrails,” the research team concluded.

As AI becomes more deeply integrated into workplaces, schools, and public life, understanding these weaknesses may shape future regulations — and force developers to rethink how AI interprets context, intention, and risk.

Sources: Cybernews

This article is made and published by Asger Risom, who may have used AI in the preparation