Generative AI systems often feel ready long before they actually are. They work internally and look impressive during demos. But once exposed to real users, the risk profile changes. And, sometimes, dramatically.

A controlled environment becomes unpredictable, because usage patterns start shifting, public inputs vary, and the system is pushed in ways it wasn’t designed for.

This is where generative AI security stops being about the model itself and becomes a system-level concern.

And this matters especially for companies deploying AI assistants, RAG systems, customer-facing chatbots, AI agents, or LLM-powered workflow features where the model can access data, call tools, or influence user decisions.

In this article, I break down where real risks come from, what teams tend to overlook, and what needs to be in place before going to production.

What actually changes when a GenAI product goes public

Although it may sound surprising, the biggest change we’re looking at here is contextual. Pure technicalities don’t offer the full picture. 

That’s because there are two different transitions happening at the same time – quickly scaling from, say, 20 to 200 or more users, and going from internal users to the public. They often come together, but introduce very different risks.

Transition 1: From controlled internal usage to unpredictable real-world inputs

Internal tools operate in a controlled setting. The user base is small, known, and accountable. People have at least some understanding of how the system works, and how it should be used.

Public products don’t have that buffer. They’re tested through a variety of use cases and can be exposed by certain situations. These include:

  • a wide range of inputs, including unexpected ones at the design and development stage
  • automated probing and repeated attempts to find weaknesses
  • people sharing what works and what breaks in a product (even unintentionally, to vent on a user forum).

It’s a completely different environment than a closed system, one much closer to an adversarial one.

Transition 2: From known users to broader exposure and misuse risk

When building internal tools, teams naturally test the “happy path”. These are the scenarios where the system behaves as expected. The less pleasant cases tend to come later.

There’s also a testing bias here. Systems are usually validated internally or with colleagues, i.e., people who already understand the context. It creates a false sense of confidence, because there’s no truly objective testing from an outside perspective.

That’s partly because most projects start with a quick demo or proof of concept. Speed matters, budgets matter, and safety often ends up being postponed.

The problem is that once the system becomes public, those postponed scenarios become the default.

In many cases, things break because users don’t understand the system. So:

  • they don’t know how to prompt it
  • they don’t know its limits
  • they don’t know what kind of input it expects.

This sort of gap leads to incidents that create security problems, even if no one intended harm.

The three layers of generative AI security risk

In my experience, the conversation on generative AI security risks often gravitates toward the model itself – particularly, jailbreaks and unsafe outputs. That’s a useful starting point, but it doesn’t reflect where most real-world issues originate.

Risks appear across three areas: the model, the integrations around it, and the way the system is operated.

Model-level risks: hallucinations, jailbreaks, and unsafe outputs

Model providers invest heavily in this layer. Techniques like alignment, red teaming, and model hardening are designed to make models more resistant to misuse. For most product teams, this isn’t where direct control lies. 

Out of this category of risks, there’s one area that requires special attention, i.e., AI hallucinations. I discuss them in detail in the next section.

Integration-level risks: tools, APIs, permissions, and data access

The moment a model is connected to tools, data, or external systems, the risk profile also changes. At this point, the system can query databases, call APIs, write data, or trigger workflows. This is also where system design starts to matter more than model behavior.

If a team misconfigures permissions, grants too broad of an access, or connects components poorly, it can lead to outcomes that are difficult to predict and even harder to control. 

Adding capabilities increases exposure (and, by extension, what can go wrong). Once the model is wrapped in an interface and exposed through APIs, it becomes subject to the same kinds of attacks as any other software system. Standard application-level vulnerabilities start to apply alongside AI-specific ones.

Suffice it to say, these types of risks were ranked as the number one security risk in OWASP’s 2025 Top 10 list.

Operational risks: logging, monitoring, access control, and incident response

The operational layer is less visible, but just as important. Logging, monitoring, access control, and incident response determine how well a system can handle issues once they occur. In many projects, these elements are addressed late (if at all) because the focus is on getting the system to work.

Without the right operational measures, even relatively simple issues become difficult to diagnose. There’s no clear visibility into what the model did, what it accessed, or how a particular outcome came to be.

Basic controls, such as user authentication, can also play a meaningful role. Beyond access management, they introduce accountability, i.e., they reduce anonymous misuse and make behavior easier to trace.

Risk layer What it covers Why it matters in user-facing products
Model-level risk Hallucinations, jailbreaks, unsafe or misleading outputs The model can produce incorrect, unsafe, or unsupported responses during normal user interactions.
Integration-level risk APIs, tools, databases, permissions, external systems Poorly scoped access can turn model mistakes or malicious inputs into real actions or data exposure.
Operational risk Logging, monitoring, authentication, access control, incident response Without visibility and accountability, teams may not know what happened, what the model accessed, or how to respond.

GenAI attack vectors your team might not be thinking about

Adversarial suffix injections

These are algorithmically generated character sequences that can push a model into unsafe behavior.

They may not look like normal abuse, so prompt-based protections can miss them.

How to reduce the risk: Add input constraints, anomaly detection, model-level safeguards, and layered monitoring.

There’s a class of techniques that, unlike prompt injections, are algorithmic rather than conversational. One example is adversarial suffix injection.

Instead of trying to “convince” the model, these attacks generate specific sequences of characters that don’t make sense to a human. They’re not words in any language and they look like noise, but they’re not random.

They’re optimized in a way that, when appended to a prompt, changes how the model behaves. The model can suddenly start producing outputs it would normally refuse.

A useful way to think about it is that when processing a prompt, the model's internal reasoning moves through a massive space of possible activation states. Depending on where the input steers it within that landscape, the model behaves differently, becoming more or less resistant to certain instructions. 

While safety training normally anchors the model inside a designated 'safe' region, these adversarial suffixes act like precise mathematical detours, pushing the model's internal processing path completely out of its safe zone.

Once that happens, the usual safeguards don’t apply in the same way. The model hasn’t been convinced to break the rules because it’s been nudged into a state where those rules are no longer enforced as expected.

What makes it dangerous in production:

These attacks are difficult to detect because they don’t resemble typical misuse. Traditional guardrails and prompt-based protections don’t reliably catch them. In practice, it’s not just about detectability. It’s the fact that very few systems even include dedicated components to detect this class of attacks, as they introduce additional time and cost overhead. 

On top of that, awareness is still low, and executing such attacks isn’t trivial: they often require knowledge of the specific model being targeted, or running multiple tailored attempts across different models. As a result, many teams either don’t know about the risk or assume the probability of a successful attack is too low to justify investing in protection. In a production system, that means defenses built only around “what the user is asking” can be bypassed entirely.

AI hallucinations with security consequences

Model-generated information that sounds plausible but is false, unsupported, or misleading.

In user-facing products, hallucinations can lead to wrong decisions, incorrect workflow triggers, misleading user guidance, or actions based on information that does not exist.

How to reduce the risk: Measure hallucination patterns, validate outputs before they affect other systems, add human review for high-impact actions, and use grounding, retrieval checks, or second-model verification where appropriate.

Hallucinations are often treated as a quality issue, but as a Machine Learning expert, I must say they behave much closer to a security risk. They don’t require an attacker, nor do they depend on edge cases. They happen during normal usage because of how these systems work.

Large language models are designed to predict the next token based on previous ones. They are not built to verify facts or guarantee correctness. In fact, OpenAI researchers admitted that AI systems hallucinate, because the methods they’re trained with reward taking guesses over acknowledging lack of information. When we use them as decision-making tools, assistants, or sources of truth, we’re stretching them beyond their original purpose.

To date, there are no 100% hallucination-free systems. The goal for genAI security experts should be to understand where these issues matter and how they propagate through the system.

In some cases, the impact is small. In others, it directly translates into cost, incorrect actions, or broken processes. A system suggesting something that doesn’t exist, triggering the wrong workflow, or returning misleading information can be just as damaging as a traditional vulnerability.

Now, the reason why I believe they’re the most important model-related risk is that hallucinations occur regularly. Compared to jailbreaks, which require intent and a formidable level of expertise, they’re something every user encounters.

Attachments and retrieved documents as an attack surface

These are uploaded files, retrieved documents, or external content that become part of the context the model uses to generate a response.

Attachments can contain hidden or indirect instructions that influence model behavior without the user explicitly entering a malicious prompt. This makes attacks harder to detect, trace, and control.

How to reduce the risk: Treat all uploaded and retrieved content as untrusted input, separate trusted instructions from document content, scan files before processing, limit what document-based context can trigger, and monitor unusual model behavior.

The ability to upload a direct attachment introduces threats of its own, because the model is no longer relying only on user prompts, it’s also consuming external documents. And those documents are not inherently trustworthy.

They can come from internal knowledge bases, uploaded files, or third-party sources, but from a security perspective, they should all be treated as untrusted input.

Every retrieved document becomes part of the prompt the model sees. That means any instructions, patterns, or manipulations embedded in those documents can influence the model’s behavior, often without visibility at the application level. In that sense, retrieval acts as an indirect injection channel. The user doesn’t have to introduce anything explicitly, because the system can do it on their behalf.

What makes it dangerous in production:

Malicious attachments can silently influence model behavior, making attacks harder to detect and trace. In production, this creates a risk where the system effectively injects itself.

Fine-tuning as a security risk

The process of adapting a model with additional domain-specific data, which can change not only what the model knows but also how it behaves.

Fine-tuning can shift the model away from its original safety balance, making it less predictable in edge cases and potentially weakening safeguards that were present in the base model.

How to reduce the risk: Evaluate the model after every fine-tuning round, test for regressions in safety behavior, use carefully curated training data, avoid unnecessary over-tuning, and monitor how the model behaves once deployed.

Instead of giving the model access to new information like in the case of RAG, you fine tune it so it actually contains that knowledge. Sometimes that’s necessary, because there are cases where retrieval isn’t enough, and you want the model to operate directly within a specific domain.

But there’s a tradeoff that’s easy to miss. When open-weight and open-source providers launch a model, they’re releasing something that sits in a very particular “point”, which is balance between usefulness and safety. A lot of effort goes into making sure the model is reasonably resistant to hallucinations, misuse, or leaking information. When you fine-tune, you move that point.

You’re adding knowledge, but at the same time you’re shifting the model away from the place where it was originally the most stable. And the more you fine-tune, the further you move it.

At some point, the model can start to “forget” some of those constraints. Not in an obvious way, but in how it behaves in edge cases, how it reacts to unusual inputs, or how strict it is about things it previously refused to do.

What makes it dangerous in production:

Fine-tuning alters how the model behaves by adding knowledge. Over time, it can weaken the safeguards that were built into the original version. In production, that means a model that looks more tailored and capable, but is also less predictable and harder to control.

Generative AI security checklist: 9 safeguards before production

Security in GenAI systems doesn’t come from a single mechanism. It’s the result of multiple layers working together, some familiar from traditional systems, others specific to how models behave.

Here are some of the main patterns that can help make the biggest difference:

  • Least-privilege access. Models should only be able to do exactly what the task requires – nothing more. The same applies to users. It’s over-scoped permissions, whether for humans or models, that turn small issues into real incidents.
  • Input validation and output handling. We should treat model outputs as untrusted data before they interact with other systems. What the model generates should be validated just like any external input.
  • Prompt architecture discipline. As I’ve already explained, not all context is equal. System instructions, user input, and retrieved content should be clearly separated. Mixing trusted and untrusted inputs in a single prompt increases the risk of unintended behavior.
  • Working with hallucinations. This starts with proper evaluation, i.e., measuring how often hallucinations occur and under what conditions. From there, teams can introduce safeguards like hallucination detectors or feedback loops. In more advanced setups, a second model can be used to verify outputs before they’re accepted. In practice, though, much of this is still handled heuristically rather than through robust, standardized methods.
    Note: Approaches like post-hoc validation (checking outputs after generation) or using an “LLM as a judge” setup, where a second model double-checks the first, are a popular method, though they’re more of a workaround rather than a fundamental solution to the problem. 
  • Human-in-the-loop gates. For high-impact actions (especially in highly-regulated industries), the model should suggest actions but not execute autonomously. Keeping a human in the decision loop significantly reduces the risk of irreversible errors.
  • Agentic decomposition. Instead of giving the model one large, complex instruction, break the task into smaller steps. Models handle simpler, well-scoped tasks more reliably. This reduces both hallucinations and the chance of unexpected behavior.
  • Adversarial suffix filtering. Filtering for suffix patterns, or constraining inputs at the token level, can help mitigate this class of attacks. It’s an evolving area worth keeping an eye on.
  • Logging and observability. Beyond standard request logs, it’s important to track model inputs, outputs, tool usage, and (where possible) intermediate reasoning steps. This is essential for debugging, auditing, and responding to incidents.
  • Traditional application and endpoint security. They require the same level of attention as in non-AI systems. This calls for implementing standard protections such as network firewalls and safeguards against attacks like DDoS. Tools like AWS Shield and AWS WAF are good examples, and they help protect your infrastructure regardless of whether AI is involved.

This is not an exhaustive checklist, but a set of patterns that tend to have the highest impact. In practice, the right combination depends on the specific system, its architecture, and its risk profile.

When working with teams at STX Next, this is exactly where we focus: helping identify which safeguards matter most before a system reaches production.

Before launch: questions to ask your team

Before exposing a GenAI feature to real users, ask:

  • What data, tools, APIs, or workflows can the model access?
  • Which actions can the system trigger without human approval?
  • Are prompts, attachments, retrieved documents, and model outputs treated as untrusted input?
  • What happens if the model hallucinates, follows hidden instructions, or returns misleading information?
  • Can we trace what the model saw, generated, accessed, and triggered during an incident?
  • Which safeguards are already in place, and which ones are still assumed rather than tested?

GenAI security starts before launch

Teams often postpone monitoring and deeper security checks, assuming they can be added after launch. In practice, that rarely happens. New features take priority, and gaps remain.

If I were to name just one main takeaway, it’s to build a zero-trust approach to security. It’s always better to over-secure than under-secure, and to automate double-checking wherever possible. We should never assume things will go according to plan, particularly in an area as quickly developing as AI.

At STX Next, we help teams move from AI proof of concept to secure, production-ready AI systems. That usually means reviewing the architecture around the model: data access, permissions, integrations, observability, deployment setup, and the operational controls needed before exposing AI features to real users.