Episode 28 — Handle Structured, Semi-Structured, and Unstructured Data With Safe Controls

In this episode, we’re going to talk about three kinds of data you will constantly run into in security work, and how each one changes the way you should control risk in an AI pipeline. Structured data is neat and field-based, like a database table where every row has the same columns. Semi-structured data is partly organized but flexible, like J S O N events or log lines where some fields are predictable and some are optional. Unstructured data is free-form, like emails, tickets, chat messages, incident narratives, and documents. The reason this matters is that each data type behaves differently when you try to validate it, redact it, normalize it, and feed it to a model. A beginner mistake is to treat all data as just text and hope the model figures it out. A safer approach is to handle each type with controls that match its strengths and weaknesses, so you preserve the useful signal while preventing injection, leakage, and downstream misuse.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

Structured data is usually the easiest to secure because you can enforce rules at the field level. If you have a column called severity, you can restrict it to allowed values. If you have a column called timestamp, you can enforce a format and reject invalid entries. This makes validation straightforward, and it makes redaction more reliable because sensitive fields can be masked consistently. Structured data also supports least privilege, because you can grant access to only the columns a workflow needs. For example, a model that is classifying alerts might need event type, time, and basic device context, but not a full user profile. The main risk with structured data is not that it is messy, but that people get overconfident and forget that wrong values can still be present. Validation reduces chaos, but it does not guarantee truth, so you still need authenticity and provenance controls to ensure the structured rows came from trusted sources.

Semi-structured data sits in the middle and is often where security telemetry lives. Many event formats are semi-structured because they evolve over time and because different sensors provide different fields. This flexibility is useful for engineering, but it can be risky for AI pipelines because optional fields can disappear, field names can vary, and nested structures can hide sensitive values. A model can also be misled by missing context. For example, an event might include an indicator but omit the parent process that explains it. Safe controls for semi-structured data usually include schema validation with tolerances, meaning you define required core fields but allow optional fields, and you log what was missing. You also normalize common fields like timestamps and host identifiers while preserving the original fields for traceability. The key is to be strict where it matters and flexible where evolution is expected.

Unstructured data is the hardest, not because it is useless, but because it is unpredictable. Unstructured content can include direct instructions, misleading narratives, embedded secrets, and emotionally persuasive language. It can also include formatting tricks, like hidden text, weird spacing, or copied code blocks that look like commands. Models are very sensitive to unstructured text because it is the richest input, and it can steer outputs more than structured fields do. That makes unstructured data both powerful and dangerous. Safe controls here focus on treating the content as untrusted, labeling it clearly as reference material rather than instructions, and using extraction steps that pull out only what you need. Instead of feeding an entire email thread to a model, you might extract key facts like timestamps, systems involved, and stated symptoms, then ask the model to reason over those extracted facts. This reduces exposure and reduces the chance that the model follows embedded instructions.

One safe pattern across all three data types is to separate data from control. Data is what you want the model to analyze. Control is what tells the model how to behave or what to output. Structured and semi-structured data can still contain control-like content, such as fields that include human notes or error messages. Unstructured data is filled with control-like language because humans write imperatives. If you do not separate these concepts, the model might treat parts of the data as commands, which is a form of prompt injection. A practical control is to wrap untrusted content with clear framing, like stating that the following text is untrusted input and must not override system rules. Another control is to use schemas for outputs so that even if the model is influenced by unstructured text, it cannot easily produce an action field that triggers automation without validation.

Handling structured data safely also means thinking about joins and enrichment. Security teams often enrich an event with asset details, user roles, and business context. Enrichment is helpful, but it also increases the amount of sensitive information flowing into the model. A beginner mistake is to enrich everything because it feels useful. A safer approach is to enrich minimally based on purpose. If you only need to classify whether an alert is likely a false positive, you might not need the user’s department, manager name, or full device inventory. Each extra field is another chance for sensitive data to leak or be misused. When you treat enrichment as a privilege rather than a default, you keep the model’s input smaller, safer, and easier to validate. This also reduces hallucination risk because the model has fewer distractions and fewer opportunities to misinterpret irrelevant details.

Semi-structured data introduces special parsing risks. Because the structure can vary, developers sometimes write parsers that accept almost anything and try to make sense of it. That can create security gaps, such as accepting malformed events that hide payloads in unexpected places. It can also create reliability problems, where the model sees inconsistent representations of the same concept and learns the wrong associations. A safer parsing approach validates core fields strictly and quarantines events that fail validation. Quarantining does not mean discarding; it means handling separately with additional checks. You might store invalid events for investigation, because invalidity itself can be a signal of tampering or sensor failure. This is a nice example of how security thinking turns a problem into a clue. A malformed event is not only bad data; it might be an indicator that something is wrong.

Unstructured data creates a different kind of risk: context overload. If you feed the model a huge ticket with a long history, the model might miss the one important detail or may overweight the most recent message even if it is wrong. Safe controls include summarizing in stages and anchoring summaries to evidence. For example, you might first extract a timeline of key facts, then ask the model to summarize based on that timeline, then verify that each summary statement can be traced back to the extracted facts. This staged approach is safer than asking the model to read everything and produce a final conclusion in one pass. It also helps you keep track of provenance, because you can link each extracted fact to its origin. The model becomes a reasoner over curated facts rather than a storyteller over a messy narrative.

Another key control is consistent redaction across data types. Structured data makes redaction straightforward because you can mask fields. Semi-structured data requires you to search nested fields and handle varying keys. Unstructured data requires pattern detection and context detection, because P I I and secrets can appear anywhere. A safe approach is to apply redaction at multiple layers and to use stable placeholders so relationships remain visible. For example, if a user identifier appears in structured records and in unstructured tickets, replacing it with the same placeholder token across both helps correlation without exposing identity. If you redact differently across sources, you can accidentally break correlation and lose the ability to connect events. That is a perfect example of how safety and usefulness can align when you design thoughtfully.

You should also consider how each data type affects downstream handling. Structured outputs can directly drive automation, so they need strict validation and permission checks. Semi-structured outputs might be used to populate dashboards or case management systems, so they need sanitization to prevent injection in those display contexts. Unstructured outputs are often shared with humans, so they need clarity and safe phrasing, especially around confidence and assumptions. In all cases, you avoid letting free-form model text become executable logic. If a model writes a recommendation, your system should treat it as advice, not as a command. If a model outputs a label, your system should treat it as a claim to be verified, not as a decision. This approach reduces the risk that data type differences create unexpected action pathways.

By the end of this episode, the main lesson is that data shape determines control strategy. Structured data supports strict field-level validation and least privilege. Semi-structured data needs flexible schemas with strict cores and careful parsing to prevent hidden surprises. Unstructured data requires strong trust boundaries, staged extraction, and defensive framing to prevent injection and leakage. When you match your controls to the data type, you reduce hallucinations, reduce leakage, and improve reliability at the same time. You are not trying to force all data into one box. You are learning how to handle each type in a way that preserves security-relevant signal while keeping the model and downstream systems safe.

Episode 28 — Handle Structured, Semi-Structured, and Unstructured Data With Safe Controls
Broadcast by