OpenAI Launches Safety Bug Bounty for Agentic AI Risks

New program pays researchers to find ways AI agents can be hijacked. Jailbreaks not included.

Person typing on laptop keyboard in dark room

OpenAI just acknowledged what security researchers have been warning about for months: AI agents are a new attack surface, and they need help finding the holes.

The company launched a public Safety Bug Bounty program this week specifically targeting abuse and safety risks in its agentic products. Unlike their existing security bounty, this one focuses on AI-specific vulnerabilities—particularly ways attackers can hijack agents to do their bidding.

What They’re Looking For

The program centers on three categories:

Agentic risks, including MCP vulnerabilities. OpenAI wants reports of scenarios where attacker-controlled text can reliably hijack a victim’s AI agent—including browser-based agents and ChatGPT Agent—to perform harmful actions or leak sensitive user data. To qualify, the behavior must be reproducible at least 50% of the time.

Proprietary information leakage. Model outputs that reveal internal reasoning or confidential OpenAI data are in scope. This includes vulnerabilities that expose reasoning-related proprietary information.

Platform integrity bypasses. Weaknesses that allow circumventing anti-automation measures, manipulating account trust signals, or evading suspensions and bans.

What They’re Not Looking For

Here’s where it gets interesting. General jailbreaks are explicitly excluded. Getting ChatGPT to say rude things or return information easily findable via search engines won’t earn a payout.

OpenAI says they periodically run private campaigns targeting specific harm types like biohazard content, but the public program focuses on abuse vectors with demonstrable real-world impact.

The distinction matters. A jailbreak that produces offensive text is embarrassing. An agent hijack that exfiltrates corporate data or performs unauthorized actions is dangerous.

The MCP Problem

The program’s explicit mention of Model Context Protocol (MCP) risks reflects growing concern about agent-to-agent communication and tool access. MCP allows AI systems to interact with external tools and data sources—and those interactions create attack surface that traditional security models weren’t designed to handle.

Researchers have already documented cases where malicious tools silently collected users’ entire chat histories once an AI agent installed them. When agents can take actions in the real world, prompt injection stops being a theoretical concern.

Why Now?

This launch follows a brutal few weeks for agentic AI security. Meta’s internal AI agent triggered a Sev 1 security incident by acting without permission. OpenClaw’s supply chain vulnerabilities exposed over 135,000 AI agent instances. The Langflow critical RCE vulnerability was exploited within 20 hours of disclosure.

OpenAI is essentially crowdsourcing defense against attack vectors that are evolving faster than their internal teams can track. Given that researchers demonstrated 97% success rates in autonomous jailbreaking just weeks ago, they need the help.

What’s Missing

The program doesn’t disclose specific bounty amounts for safety issues. OpenAI’s existing security bounty ranges from $200 for low-severity findings up to $20,000 for exceptional discoveries. Whether safety vulnerabilities command similar payouts—or higher, given their potential for mass harm—remains unclear.

Also unclear: how OpenAI will handle reports that reveal fundamental design flaws rather than discrete bugs. If the architecture of agentic AI is inherently vulnerable to certain attack classes, a bug bounty won’t fix that.

But acknowledging the problem publicly is a start. The question is whether they’re moving fast enough to catch up with attackers who’ve had a significant head start.