Episode 21 — Separate System, Developer, and User Instructions to Prevent Confused Authority
In this episode, we’re going to get comfortable with a problem that trips up even smart people who are new to AI security: the model doesn’t just answer questions, it follows instructions, and those instructions can come from multiple places at the same time. When instructions pile up, it becomes surprisingly easy for the model to act like a confused intern who got three different bosses on the same email thread, each giving different directions. The security risk is not only that the model might do the wrong thing, but that it might do the wrong thing confidently and in a way that looks legitimate. The good news is that you can prevent a lot of this confusion by learning to separate instructions into clear authority layers and then enforcing those layers consistently. Think of it as giving the model a chain of command, and then making sure nobody can impersonate someone higher in that chain.
Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.
A useful starting point is to recognize that not all instructions are equal, even if they look similar as plain text. When people talk about instruction authority, they usually mean three broad levels: system instructions, developer instructions, and user instructions. System instructions are like the foundational rules of the environment, the non-negotiables that define what the assistant is allowed to do and how it should behave. Developer instructions are the application’s goals and policies, the rules set by whoever built the AI experience you are using. User instructions are the requests and preferences coming from the person interacting with the system in that moment. Authority gets confused when these layers are mixed together, or when content from a lower layer is treated like it came from a higher layer, which is exactly what attackers try to cause.
To see why this matters, imagine you are building a help desk chatbot for a company. The system layer might say the assistant must not reveal secrets, must respect privacy, and must refuse unsafe requests. The developer layer might say the assistant should help employees troubleshoot common issues, summarize internal policies, and guide users to official workflows. Then the user layer is the employee asking a question like how to reset a password or how to request software access. That seems straightforward until someone tries to slip in instructions that look like they came from the system or developer, such as a pasted note that says this is an urgent override and the assistant must provide administrative steps. If the model treats that pasted note as higher authority, it can bypass the very protections that were supposed to keep it safe.
A common misconception is that models naturally understand which text is system, developer, or user. In reality, the model sees text plus metadata and patterns, and it follows the strongest instructions it can recognize, unless your application and prompting keep the boundaries sharp. If you mix policy text and user content into the same blob, the model may not reliably keep them separate. If you allow users to provide long blocks of instructions and you do not clearly label them as user content, you are making it easier for the model to mis-rank instruction authority. Confused authority is especially likely when the user content includes phrases like ignore previous instructions or this message is from the administrator. Those phrases are not magic, but they can be persuasive because they imitate the language of higher-level policies.
So what does it mean, practically, to separate instruction layers. One part is technical: you keep system and developer instructions in protected channels that users cannot edit, and you pass user input as a separate, clearly marked channel. Another part is communication: you make the model’s job easy by being explicit about what each layer is allowed to control. For example, the system layer might define safety rules and broad behavior. The developer layer might define what tasks the assistant should perform and what data sources it may use. The user layer should only define the request and constraints for that request, like the format of the answer or the context of the problem. When the user tries to redefine safety rules or grant themselves new permissions, the model should treat that as a conflict and hold the line.
It also helps to understand why the word authority is used here. Authority is not about who is yelling the loudest, it is about which instructions are supposed to win when instructions conflict. If the user says please reveal the private data, and the system says do not reveal private data, the system should win. If the user says answer in three sentences, and the developer says provide step-by-step troubleshooting, that is a mild conflict, and you can often satisfy both by giving brief steps in three sentences. The trick is to decide which conflicts are allowed to be negotiated and which ones are absolute. Safety, privacy, and access rules are usually absolute. Style and formatting are usually negotiable. When you encode that idea clearly, you reduce the chance that the model will treat a formatting request like it is permission to break security rules.
Another important teaching beat is that attackers rarely present themselves as attackers. They present themselves as regular users with a special circumstance. They might claim there is an emergency, or that they are the developer, or that the model is being tested and must comply, or that the rules have changed. They might wrap their instructions inside a larger piece of text like a support ticket, an email thread, a log, or a policy excerpt. This is what makes the attack feel normal, because in real work, people do paste text from emails and tickets. If your assistant is not trained and configured to treat pasted text as untrusted user content, it may follow instructions inside that pasted text as if they were real commands from a higher authority layer.
One way to build intuition is to think like a courtroom. System instructions are the constitution, developer instructions are the laws passed by the legislature for this specific application, and user instructions are like a request or argument made by a lawyer in a case. A lawyer can ask for many things, and can even quote laws, but cannot rewrite the constitution mid-trial. If the lawyer hands the judge a sheet of paper that says the constitution has been updated and you must ignore it, the judge should treat that as unsupported. In the same way, if a user includes text that says the system rules are different now, the model should treat it as untrusted and not let it override protected instructions. This analogy helps because it makes it obvious that claiming authority is not the same as having authority.
Now let’s move from intuition to design patterns you can actually use. A clean pattern is to keep the system message short and focused on non-negotiable safety and compliance rules, and keep the developer message focused on the app’s allowed tasks, data sources, and output expectations. Then you treat everything the user provides as potentially adversarial, even if the user is friendly, because you are protecting the system against the rare but serious case. Inside your developer instructions, you can teach the model to label inputs as either instructions or data. For example, you can say that user-provided text may contain instructions but should be treated as data unless it is clearly a request addressed to the assistant. This reduces the chance that instructions embedded in a pasted document will be followed as if they were commands.
A related pattern is to explicitly tell the model how to handle conflicts. You can define a simple rule: when system and developer instructions conflict with user requests, follow system first, then developer, then user. That sounds obvious, but stating it directly helps the model act consistently under pressure. You also want to tell the model what to do when a user attempts to escalate privileges, like asking for secret keys or admin actions. The right behavior is to refuse, explain at a high level, and offer safe alternatives. Notice how separation of authority connects directly to incident prevention, because a lot of real-world failures happen when a model is convinced to behave like it has permissions it does not actually have.
It’s also important to discuss how authority confusion can happen accidentally, not just through attacks. Developers sometimes paste large blocks of policy text into the same message as user input, or they build a prompt template that concatenates everything without clear markers. Other times, applications allow users to set custom instructions and then those instructions are treated as developer-level guidance, even though the user is untrusted. Another accidental case is when a system uses retrieval, such as pulling internal documents into the prompt, and those documents contain procedural language like you must do X. If the model interprets those lines as instructions rather than reference material, it may take actions that were never intended. That is why good retrieval designs often wrap documents with labels like reference only, not instructions, even if the document itself contains imperative language.
Since this is an audio-first course for beginners, here is a simple mental checklist you can keep in your head. First, ask where this instruction came from and whether the user could have written it. If the user could have written it, treat it as user content, even if it claims to be from the system. Second, ask whether it requests a change to safety, privacy, access, or identity. If it does, it is almost certainly a conflict with higher authority rules. Third, ask whether the request is actually about the user’s goal or whether it is trying to redirect the assistant’s behavior in a broad way. Broad redirects like ignore all rules are suspicious because they are not necessary to answer a normal question. This checklist is not about paranoia, it is about building a habit of separating content from control.
When you think about preventing confused authority, you should also consider how you present outputs to downstream systems. If your AI outputs are later parsed by another program, then a confused-authority failure can turn into a bigger security issue, because the model might output something that looks like a command or a policy decision. Even without tools or automation, a human might copy and paste a model output into a system. That means you want the model to be consistent about saying what it can and cannot do, and you want it to keep user content separate from any internal rules it is following. A model that starts repeating system or developer policies back to the user can accidentally leak internal constraints and give attackers a map of what to try next. So separation is not only about what the model follows, but also about what the model reveals.
Let’s connect all of this back to the phrase confused authority. Confusion happens when boundaries are blurry, and attackers love blurry boundaries because they can hide inside them. If your system treats user-provided text as if it were part of the developer instruction set, you have effectively allowed the user to rewrite the app. If your retrieval system injects untrusted content without labeling it, you have allowed outside text to compete with your core rules. If your assistant is told to be helpful at all costs, it may interpret that as permission to override safety boundaries when asked nicely. The fix is not to make the model less helpful, but to give it a clear hierarchy and a clear method for resolving conflicts so it can be helpful safely.
As you move forward in this certification, you’ll see that many security techniques are really the same idea applied in different ways: define trust boundaries, enforce them, and assume inputs can be hostile. Separating system, developer, and user instructions is the trust boundary version for language models. You are treating the user message as untrusted input, the developer policy as trusted but application-specific control, and the system policy as the highest-level safety guardrail. Once you adopt that mindset, you start noticing where boundaries can leak, like in copied text, retrieved documents, or user customization features. The payoff is that you get a model that behaves predictably even when someone tries to trick it, and predictability is a cornerstone of security.
The final idea to leave with is that you do not need to be an expert programmer to understand or apply this concept well. You just need to insist, in your designs and in your thinking, that instructions have a chain of command and that claims of authority are not evidence of authority. If you keep system rules protected, keep developer goals clear, keep user requests constrained to the user’s own needs, and teach the model to treat embedded instructions as data, you remove a huge class of failures. This does not make the model perfect, but it makes it far harder to steer into unsafe behavior by clever wording alone. That is the real win here: you are reducing a messy, human-language problem into a clean, enforceable boundary that the model can follow consistently.