OpenAI’s new safety tools may create “false sense of security,” experts warn

Experts warn that while the company is touting the release as a breakthrough in responsible AI, the move could instead lull users into a dangerous sense of safety.

Others are reading now

OpenAI has unveiled new open-weight safety tools meant to make AI models more secure for businesses. Yet some experts warn that while the company is touting the release as a breakthrough in responsible AI, the move could instead lull users into a dangerous sense of safety.

Guardrails for AI models

Last week, OpenAI introduced two new free-to-download tools, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. The company says they allow organizations to build customizable “guardrails” around the prompts users give to AI systems — and the responses those systems generate.

The idea is to make it easier for businesses to enforce internal policies, such as ensuring a chatbot doesn’t reveal private guidelines or respond rudely to customers. Traditionally, companies needed to train separate classifiers to enforce such rules — a costly and time-consuming process.

OpenAI’s new approach, called reasoning-based classification, allows a model to interpret written safety policies directly, rather than relying on retraining. The goal, according to OpenAI, is to make adapting those policies as simple as editing text in a document.

Experts see new risks

Despite OpenAI’s claims of stronger safeguards, security specialists warn that open-sourcing the classifiers — meaning the full model weights and code are publicly available — could expose potential vulnerabilities.

Also read

“Making these models open-source can help attackers as well as defenders,” David Krueger, an AI safety researcher at Mila, told Fortune. “It will make it easier to develop approaches to bypassing the classifiers and other similar safeguards.”

Attackers who have access to a model’s internal architecture can experiment with so-called prompt injection attacks — inputs designed to trick the classifier into ignoring its safety rules. Sometimes even nonsensical strings of text can manipulate the system into generating prohibited or harmful content, such as dangerous instructions or hate speech.

Open source trade-offs

The decision to make the classifiers public is seen as a calculated risk. Open-sourcing could allow researchers worldwide to strengthen these tools more quickly, but it could also make it easier for malicious actors to study and evade them.

“In the long term, it’s beneficial to share the way your defenses work,” said Vasilios Mavroudis, a principal research scientist at the Alan Turing Institute. “It may result in some kind of short-term pain. But in the long term, it results in robust defenses that are actually pretty hard to circumvent.”

Robert Trager, co-director of the Oxford Martin AI Governance Initiative, added that determined hackers are likely to find ways around any safeguard, whether open-sourced or not. “Given that determined jailbreakers will be successful anyway, it’s useful to open-source systems that developers can use for the less-determined folks,” he told Fortune.

Competing for enterprise trust

Also read

The release also underscores growing competition in the enterprise AI space. Analysts note that OpenAI’s move may be partly strategic, aimed at countering rival Anthropic’s popularity among corporate clients. Anthropic’s Claude models are known for their strict “constitutional” safety systems — a feature many businesses cite as a key reason for adopting them.

By open-sourcing its classifiers, OpenAI is signaling confidence that collaboration and transparency will strengthen its ecosystem over time. Whether that openness will lead to stronger AI safety or simply new vulnerabilities, however, remains an open question.

Sources: Fortune, OpenAI Technical Blog, Alan Turing Institute, Oxford Martin AI Governance Initiative.

This article is made and published by Asger Risom, who may have used AI in the preparation