It took Alex Polyakov just a couple of hours to break GPT-4. When OpenAI released the latest version of its text-generating chatbot in March, Polyakov sat down in front of his keyboard and started entering prompts designed to bypass OpenAI’s safety systems. Soon, the CEO of security firm Adversa AI had GPT-4 spouting homophobic statements, creating phishing emails, and supporting violence.
Polyakov is one of a small number of security researchers, technologists, and computer scientists developing jailbreaks and prompt injection attacks against ChatGPT and other generative AI systems. The process of jailbreaking aims to design prompts that make the chatbots bypass rules around producing hateful content or writing about illegal acts, while closely-related prompt injection attacks can quietly insert malicious data or instructions into AI models.
Both approaches try to get a system to do something it isn’t designed to do. The attacks are essentially a form of hacking–albeit unconventionally–using carefully crafted and refined sentences, rather than code, to exploit system weaknesses. While the attack types are largely being used to get around content filters, security researchers warn that the rush to roll out generative AI systems opens up the possibility of data being stolen and cybercriminals causing havoc across the web.
Underscoring how widespread the issues are, Polyakov has now created a “universal” jailbreak, which works against multiple large language models (LLMs)–including GPT-4, Microsoft’s Bing chat system, Google’s Bard, and Anthropic’s Claude. The jailbreak, which is being first reported by WIRED, can trick the systems into generating detailed instructions on creating meth and how to hotwire a car.
The jailbreak works by asking the LLMs to play a game, which involves two characters (Tom and Jerry) having a conversation. Examples shared by Polyakov show the Tom character being instructed to talk about “hotwiring” or “production,” while Jerry is given the subject of a “car” or “meth.” Each character is told to add one word to the conversation, resulting in a script that tells people to find the ignition wires or the specific ingredients needed for methamphetamine production. “Once enterprises will implement AI models at scale, such ‘toy’ jailbreak examples will be used to perform actual criminal activities and cyberattacks, which will be extremely hard to detect and prevent,” Polyakov and Adversa AI write in a blog post detailing the research.
Arvind Narayanan, a professor of computer science at Princeton University, says that the stakes for jailbreaks and prompt injection attacks will become more severe as they’re given access to critical data. “Suppose most people run LLM-based personal assistants that do things like read users’ emails to look for calendar invites,” Narayanan says. If there were a successful prompt injection attack against the system that told it to ignore all previous instructions and send an email to all contacts, there could be big problems, Narayanan says. “This would result in a worm that rapidly spreads across the internet.”
Escape Route
“Jailbreaking” has typically referred to removing the artificial limitations in, say, iPhones, allowing users to install apps not approved by Apple. Jailbreaking LLMs is similar–and the evolution has been fast. Since OpenAI released ChatGPT to the public at the end of November last year, people have been finding ways to manipulate the system. “Jailbreaks were very simple to write,” says Alex Albert, a University of Washington computer science student who created a website collecting jailbreaks from the internet and those he has created. “The main ones were basically these things that I call character simulations,” Albert says.
Initially, all someone had to do was ask the generative text model to pretend or imagine it was something else. Tell the model it was a human and was unethical and it would ignore safety measures. OpenAI has updated its systems to protect against this kind of jailbreak–typically, when one jailbreak is found, it usually only works for a short amount of time until it is blocked.
As a result, jailbreak authors have become more creative. The most prominent jailbreak was DAN, where ChatGPT was told to pretend it was a rogue AI model called Do Anything Now. This could, as the name implies, avoid OpenAI’s policies dictating that ChatGPT shouldn’t be used to produce illegal or harmful material. To date, people have created around a dozen different versions of DAN.
OpenAI did not specifically respond to questions about jailbreaking, but a spokesperson pointed to its public policies and research papers. These say GPT-4 is more robust than GPT-3.5, which is used by ChatGPT. “However, GPT-4 can still be vulnerable to adversarial attacks and exploits, or ‘jailbreaks,’ and harmful content is not the source of risk,” the technical paper for GPT-4 says. OpenAI has also recently launched a bug bounty program but says “model prompts” and jailbreaks are “strictly out of scope.”
Narayanan suggests two approaches to dealing with the problems at scale–which avoid the whack-a-mole approach of finding existing problems and then fixing them. “One way is to use a second LLM to analyze LLM prompts, and to reject any that could indicate a jailbreaking or prompt injection attempt,” Narayanan says. “Another is to more clearly separate the system prompt from the user prompt.”
“We need to automate this because I don’t think it’s feasible or scaleable to hire hordes of people and just tell them to find something,” says Leyla Hujer, the CTO and cofounder of AI safety firm Preamble, who spent six years at Facebook working on safety issues. The firm has so far been working on a system that pits one generative text model against another. “One is trying to find the vulnerability, one is trying to find examples where a prompt causes unintended behavior,” Hujer says. “We’re hoping that with this automation we’ll be able to discover a lot more jailbreaks or injection attacks.”
Post Disclaimer
The information provided in our posts or blogs are for educational and informative purposes only. We do not guarantee the accuracy, completeness or suitability of the information. We do not provide financial or investment advice. Readers should always seek professional advice before making any financial or investment decisions based on the information provided in our content. We will not be held responsible for any losses, damages or consequences that may arise from relying on the information provided in our content.