Episode 26 — Clean and Normalize Data Without Losing Security-Relevant Signal and Context

In this episode, we’re going to tackle a task that sounds like simple housekeeping but can make or break AI security outcomes: cleaning and normalizing data. Cleaning usually means removing junk, fixing inconsistent formats, and making inputs easier to analyze. Normalizing usually means making the same kind of information look the same everywhere, such as converting timestamps to a single format or standardizing hostnames. In everyday life, cleaning feels obviously good, like tidying up a messy desk. In security, though, the mess often contains clues. If you scrub too aggressively, you can erase the security-relevant signal and the context that tells you what something really means. The practical goal is to learn how to clean data in a way that reduces noise and makes analysis easier, while preserving the details that matter for detection, triage, and investigation. That balance is a core skill because AI systems are very sensitive to what you include, what you remove, and what you collapse into a single standardized form.

Before we continue, a quick note: this audio course is a companion to our course companion books. The first book is about the exam and provides detailed information on how to pass it best. The second book is a Kindle-only eBook that contains 1,000 flashcards that can be used on your mobile device or Kindle. Check them both out at Cyber Author dot me, in the Bare Metal Study Guides Series.

To understand the risk, it helps to picture security data as a story told by many imperfect witnesses. Logs, alerts, tickets, and notes are not clean because they were not created to be pretty; they were created to capture events under constraints. A timestamp with a timezone, a process path, a user name with odd characters, or a slightly inconsistent field name might seem like clutter, but it can also be the difference between benign and malicious. When you normalize, you are deciding which differences matter and which differences are superficial. If you decide incorrectly, you can accidentally hide an attacker’s behavior inside a category that looks normal. This is why security-focused cleaning is not the same as cleaning for marketing data or customer surveys. The goal is not maximum neatness. The goal is maximum usefulness without distortion.

A safe starting point is to separate cosmetic inconsistency from semantic meaning. Cosmetic inconsistency is when something looks different but means the same thing, like different date formats that represent the same moment, or capitalization differences in a hostname that do not change the actual host. Semantic meaning is when the difference itself is important, like a process running from an unusual directory, an account name that contains a suspicious pattern, or a command line argument that indicates data exfiltration. When you clean, you should aggressively fix cosmetic inconsistency because it improves analysis and reduces confusion. When you clean, you should be cautious with semantic details because those details often hold the security signal. If you remember that distinction, you will naturally avoid the most damaging type of over-cleaning, which is cleaning away the meaning.

One of the most practical patterns is to keep an original copy of the raw data and then produce a cleaned version as a separate artifact. Beginners sometimes overwrite the data as they clean it, but in security you want to preserve the raw form so you can go back and verify what was actually observed. This is also helpful when a model produces an odd conclusion and you want to check whether the cleaning step removed something the model would have needed. Think of it like taking a photo of a messy room before you tidy it, so you can prove what was there. If the cleaned data is used for model input, the raw data becomes your ground truth for auditing and for responding to questions later. This separation also reduces the temptation to keep cleaning until the data looks perfect, because you know you can always refer back to the original.

Normalization often starts with timestamps, and timestamps are a classic place where well-meaning normalization can create errors. Converting everything to a standard format is useful, but you must preserve timezone context and the difference between event time and ingestion time. Security investigations depend on ordering, and ordering depends on correct time interpretation. If you strip timezone offsets, or if you mix sources that report local time with sources that report U T C without preserving that distinction, you can create a timeline that looks coherent but is wrong. Models are very good at telling a story from a timeline, so if you feed them a distorted timeline, they will produce a confident narrative that is based on your own normalization mistake. A safe approach is to normalize times into a single standard while also keeping the original representation and clearly labeling what the time represents.

Another high-value normalization target is identifiers, such as hostnames, user accounts, and asset IDs. Standardizing these improves correlation across systems. But you should not flatten away meaningful differences. For example, two accounts that differ by one character might represent two different identities, or one might be a spoof. Two hostnames that look similar might actually be different environments, like production versus test. Normalization should aim to map multiple representations of the same entity to one canonical representation, but only when you have evidence that they are the same. Otherwise, you risk merging distinct entities and losing the ability to detect lateral movement or impersonation. In security terms, premature merging is like assuming two eyewitnesses are talking about the same person because they used similar descriptions, when in fact they are describing different people.

Cleaning also often involves removing duplicates and reducing repetitive noise, which can be helpful because security telemetry can be extremely chatty. If you feed a model thousands of identical events, it may focus on volume rather than the unique signal, and it may miss the one event that is different. Deduplication can reduce that risk, but it can also remove useful frequency information. Sometimes the fact that an event repeated a hundred times is the signal, because it indicates scanning, brute force, or automated behavior. A safer pattern is to deduplicate while keeping summary statistics, such as how many times the event occurred and over what time window. That way, you reduce the input size while preserving the behavioral context. You are compressing the data without erasing what matters about it.

Redaction is another cleaning activity that is essential for privacy and safety, but it can also damage security context if done poorly. Removing sensitive fields like full names, email addresses, and internal IP addresses may be required, but investigators often need relationships, like which user did what on which device. A safe redaction approach preserves structure by replacing sensitive values with consistent placeholders rather than deleting them. For example, you can replace a user identifier with a stable token that is the same across the dataset, so you can still track that one user’s actions without exposing their identity. The model can then reason about patterns and correlations without seeing the sensitive raw value. This is a good example of cleaning that improves safety without losing signal, because you are removing exposure while preserving analytical usefulness.

Another subtle cleaning issue is how you handle text fields like error messages, command lines, file paths, and alert descriptions. It can be tempting to strip punctuation, remove special characters, or truncate long strings to fit into limits. But in security, punctuation and structure often carry meaning. A file path indicates where a file came from. A command line argument can reveal intent. An error message might contain a specific code that identifies a technique. If you must truncate, you should do it in a way that preserves the most informative parts, such as keeping both the beginning and end of a long command line rather than only the beginning. If you must sanitize, you should escape dangerous characters for display rather than deleting them outright. The goal is to keep the semantic shape of the text intact so the model and the human can still interpret it.

Normalization also connects to bias and blind spots. If your normalization process was built around what normal looks like in your environment, it may unintentionally treat unusual but important events as formatting errors and “correct” them into normal-looking values. For example, you might normalize domain names or process names to a standard pattern and accidentally mask slight variations that indicate typosquatting or masquerading. Attackers rely on small differences that humans may not notice and automated systems may collapse. A safer approach is to preserve the original value alongside the normalized value and to flag normalization changes that were not exact matches. That way, a model can still see that something was altered by normalization and may need attention. You are making the cleaning process transparent instead of silently rewriting the evidence.

A very practical verification pattern for cleaning is to test whether security-relevant indicators survive the transformation. If you know common indicators you care about, like unusual parent-child process relationships, rare destinations, or repeated authentication failures, you can check whether the cleaned data still contains the fields and relationships that would reveal those indicators. If your cleaning step removes the field that contains the destination, you have just blinded the analysis. If your normalization step merges two distinct accounts, you have just blurred identity. Even beginners can apply this by thinking through simple questions like whether you can still answer who, what, where, and when after cleaning. If you cannot answer one of those, you probably removed too much context.

It also helps to design your cleaning pipeline so that it is reversible or at least traceable. Reversible does not always mean you can fully reconstruct the raw data, especially after redaction, but it means you know exactly what you changed. Traceability means you record transformations, such as which fields were standardized, which values were replaced, and which records were dropped. This connects directly to provenance tracking, because you want to be able to explain why the model did or did not see a certain detail. If a model output is questioned, you can point to the cleaning steps and show that a particular field was intentionally removed for privacy or that a timestamp was converted to a standard. Without this, cleaning becomes a black box and black boxes are dangerous in security because they hide both mistakes and intentional tampering.

By the time you finish this episode, the main lesson should feel like a mature version of a simple idea: clean to clarify, not to sanitize away reality. Normalize to support correlation, not to flatten away differences that might be the signal. Keep raw data separate from cleaned data, preserve original values when you normalize, and summarize duplicates rather than deleting the behavioral evidence they represent. When you redact, replace with stable placeholders so relationships remain visible. Above all, treat cleaning as a security-sensitive transformation, not a purely technical convenience. If you do that, you will feed models inputs that are smaller, clearer, and safer, while still preserving the clues that make security analysis possible in the first place.

Episode 26 — Clean and Normalize Data Without Losing Security-Relevant Signal and Context
Broadcast by