Why Production AI Still Depends on Strong Python Engineering

Introduction

The demo is always deceptively easy, I am not afraid to say it out loud, as I’ve seen it many times before. All it takes is a few clever API calls, a sleek frontend wrapper, and a prompt that somehow hits the spot on the first try, and the management is thrilled.

Unfortunately, the vast majority of AI features that look spectacular in a controlled room don't survive their first afternoon in production. Why? Because a working model is not a working product. The fragile ecosystem around the model wasn't built to hold weight.

So, what actually separates a shiny AI demo from a features-rich system that can be responsibly deployed to enterprise users?

The short answer is solid engineering. The longer answer starts with a pattern I’ve seen play out before, long before LLMs became a boardroom obsession.

The movie I’ve seen many times before

Allow me to tell a story.

Years ago, at a company I worked for, management fell head over heels for a revolutionary idea. We were going to build automated testing systems for telecommunications using a graphical programming language.

On paper, it sounded spectacular. Telecom is highly complex, right? But with this tool, you had these beautiful visual blocks representing domain-specific entities. You could just drag an "LTE message" block, point it toward the infrastructure node, and boom – you’ve translated a raw business requirement directly into a test flow. No messy code, just intuitive dragging and dropping. Management was convinced we were jumping light years ahead of the competition.

Where was the trap?

The tool was completely unfit for production. It was riddled with bugs, and the reality of using it looked absolutely nothing like the marketing documentation. Every single day, instead of building tests, we were burning hours inventing workarounds just to bypass the tool's inherent flaws because the pressure to deliver was on. When we did find a critical bug and pointed the vendor to it, the standard waiting time for a hotfix was a month. A whole month.

But the real nightmare was version control, which is the absolute bedrock of software engineering. How do you run a git diff on a graphical block diagram? If a test worked yesterday but failed today, how do you see what changed? We ended up hacking the system, forcing it to output a horrific underlying text representation just to compare files. We literally regressed in technical evolution just to accommodate a "simpler" tool.

In the end, the system looked gorgeous for marketing and worked perfectly on tiny, academic examples. But in production, our company had inadvertently paid to become gloriously unpaid beta testers for the vendor.

After two painful years, the entire project was thrown into the trash, and we went right back to a classical (scripting language-based) testing development environment.

Don’t get seduced by a sales pitch…

Why did that happen? Because leadership was seduced by a sales pitch. That in itself isn't a crime. The real mistake was not putting the tool in front of the engineers first and saying: "Look, this seems promising, go find every single weak point so we can make a data-driven decision." Instead, the decision was purely emotional, driven by the fear of missing out and the rush to beat the competition. Production is where emotional decisions go to die.

And this brings me to what is happening right now with AI.

The exact same warning lights are flashing in my brain, because the hype is identical. We are surrounded by grand promises that coding is dead, that developers are obsolete, and that AI will handle everything for us.

And the most dangerous part is that AI actually does work incredibly well in isolation. Because the demos are so impressive, the psychological grip on managers is even stronger this time. It makes it incredibly easy to surrender to the narrative, make reckless decisions with massive long-term consequences, and completely ignore the architectural reality happening in the background.

The system around the model

People love talking about models, and that's understandable, because the model is the flashy part. It's the things like writing text, generating code, answering questions, and generally getting all the attention from the end user.

But in production, the model is only one layer of the system.

Around it sit the APIs, data pipelines, retrieval systems, queues, permissions, monitoring, evaluation, and all the boring-but-essential plumbing that makes the feature work. If any of those pieces fail, users don't care how smart the model is.

And that's why strong software engineering still matters.

A shiny AI demo	A production AI system
Works well in a controlled room	Works reliably with real users, real data, and real edge cases
Depends mostly on a model and a good prompt	Depends on APIs, data pipelines, retrieval, permissions, monitoring, and evaluation
Looks impressive when everything goes right	Has to be diagnosable when something goes wrong
Can ignore legacy systems	Must integrate safely with existing infrastructure
Proves that AI is possible	Proves that AI is maintainable, observable, and safe enough to deploy

Experience teaches you what doesn't change

I've been working in software engineering professionally for about 30 years, including nearly 20 years in telecommunications. If you count the years before that, I've been programming for close to four decades.

After that much time, you start noticing something interesting: technologies change, but many engineering principles don't.

Languages come and go, and so do frameworks. Today's breakthrough becomes tomorrow's standard tool. But some habits survive every wave of change.

One of them is surprisingly simple, and I'd sum it up to "don't trust – verify".

A good engineer is naturally skeptical

I spent a large part of my career working in testing and test automation. After a while, it becomes a reflex.

A system tells you it works? Great! Now prove it. That's not cynicism, but experience.

The end user doesn't care whether a feature runs on deterministic Python code or a probabilistic AI model. They care whether it works.

If you're building a toy chatbot, maybe occasional weirdness is part of the fun. If you're building software for a bank, that's a different story. Banks need transfers with the amount the customer entered, not a random value influenced by, say, today's weather in Kuala Lumpur. Reliability is what matters.

AI Is another layer of abstraction

When I started programming, writing in Assembler was still a normal thing to do. Today, almost nobody does it. We write Python, C++, or other higher-level languages and let compilers handle the translation to Assembler.

AI feels like the next step in that evolution. Instead of writing every line ourselves, we can increasingly describe what we want and let a model generate code. In many ways, AI acts like a very sophisticated compiler that translates human intent into source code of high level programming language.

Which is fantastic. But responsibility doesn't magically disappear.

If AI generates code that causes financial damage, who takes responsibility?

The model?

Are we going to fire the AI?

"Dear ChatGPT, unfortunately we've decided to part ways, here's your 30-days’ notice."

Of course not. Someone still has to understand the business context, recognize the risks, review the output, and decide whether that code belongs anywhere near production.

Learn from failures

One concept I keep coming back to is what aviation calls “Black Box Thinking.” I read about it in a book by Matthew Syed, published under this exact title, and I encourage every decision-maker to give it a read – it's an eye opener.

The aviation industry became incredibly safe because every incident became an opportunity to learn (not because accidents stopped happening). The goal was to understand what happened and make sure the entire industry learned from it.

Software teams don't always work that way. Something breaks in production and the first question becomes: "Whose fault was it?"

The better question is: "What can we learn from it?"

That's especially important in AI projects.

Models will behave unexpectedly. Prompts will fail. And some experiments will go nowhere. That's part of working with a technology that is still evolving rapidly.

Some of the most valuable lessons in my career came from things that didn't work. Success feels good, but failure forces you to understand the system more deeply.

Perhaps that’s why I was so struck by “The Clean Coder”, a book by Robert C. Martin, one of the authors of the Agile Manifesto and someone many developers consider a legend. He spends a surprising amount of time talking about mistakes and failed projects, and that says a lot.

If we want successful AI projects, we need more of that mindset: less finger-pointing, more learning.

Data and retrieval quality – where silent failures live

In traditional software engineering, we operate on a principle of limited trust. A backend system must always validate incoming data. You check the input format, you verify permissions, and you protect the API. Then, you test the code itself with unit and integration tests.

But traditionally, we don’t validate the output. Why would we? If deterministic code processes a verified input, the output is mathematically guaranteed to be correct. Checking your own checked checks is the path to madness.

The famous Polish sci-fi author Stanisław Lem actually wrote a story about this. A sophisticated, AI-controlled rocket is landing amidst a massive celebration. Suddenly, it tilts and crashes. During the post-mortem, they discover the AI was trained by a professor with an absolute obsession with self-verification. The rocket’s computer started checking its landing parameters, then checked the validity of its own checks, falling into an infinite recursive loop until the stack overflowed and the system died.

In a standard backend, we avoid this recursion. But the moment you introduce LLMs into production, the rules completely change.

We still clean the inputs and build reliable Python data pipelines to avoid "garbage in, garbage out." But today, the AI often generates the processing code itself, and then we test it using unit tests generated by Claude. We are letting AI write the code, verify the code, and then we blindly trust it.

Think about code security impacted by AI. We ask AI to "code for me" and we give it tools to read/write your disk. Why can't we imagine a scenario where AI comes to the conclusion – "this code is a mess, I need to start fresh"? In this situation, it might decide to run

a `rm-rf ~/`command, which would effectively mean wiping your disk. If you don’t think it’s a realistic scenario, then I recommend reading this story on how Claude Code escapes its own denylist and sandbox.

Now, even if we fix the AI sandboxing problem and even if the code runs perfectly, we’re still handing over the data to a model that is inherently non-deterministic. We can lower the temperature and optimize our chunking and indexing to squeeze out consistency, but the response remains unpredictable. For the first time in software history, we must validate the input, the code, and the output. Strong Python engineering is the only thing standing between a system that handles this non-determinism and one that quietly breaks.

Layer	What can go wrong
Input	Bad, incomplete, unauthorized, or poorly structured data enters the system
Retrieval	The model receives the wrong context, outdated context, or no relevant context at all
Generated code or logic	AI-generated code appears correct but introduces hidden security, reliability, or maintainability risks
Output	The model returns an answer that sounds confident but is wrong, inconsistent, or unsafe to use
Production behavior	Failures happen silently unless the system has logs, tracing, evaluations, and observability built in

This brings us to the second illusion of the AI hype – the fantasy of the "Greenfield project."

People look at AI-generated code and think, "Wow, this is cleaner than what a junior dev would write." Sure, that is true in a vacuum. But AI does not exist in an intergalactic vacuum. It exists in a world of legacy systems. Enterprise software is a patchwork of 40-year histories, ancient PHP scripts, microservices, and macaroni code.

No rational business is going to let an LLM rewrite their entire infrastructure from scratch and risk destroying their revenue stream. Companies want to implement AI prudently to preserve business value.

That requires the ability to navigate legacy code. You have to know exactly where the integration points will hurt so you can minimize the pain. You must safely carve out isolated components, insert the AI, and build feedback loops to monitor how it interacts with the old system. Merging modern, volatile AI features into rigid legacy environments isn't a popular topic in current hype cycles, but it is the exact playground where seasoned Python engineers are desperately needed.

Evals and observability – engineering for uncertainty

The tech marketing pitch claims AI can write both the code and the tests. Management loves it as it means faster deployment and quicker cash. That works beautifully during a "sunny day scenario" when everything passes. But senior experience pays off the moment a non-trivial, production-shattering bug appears, and you spend a week trying to diagnose it.

I learned this early in my career, working on CAD software. There was a critical bug deep in the core of the system, which was built on an AVL tree. The core had been written in C by a summer intern who quit right after. The code featured sparse comments in French and relied heavily on generic variable names like x, y, i, and j. There was zero domain context. My tech lead told me to give up, but being fresh out of university, I refused.

Ultimately, I had to walk back to his desk with my tail between my legs to say that I failed. The system was completely untroubleshootable because standard software engineering disciplines had been ignored. When systems become unmaintainable, you enter "band-aid programming" – slapping patch upon patch just to keep errors from surfacing.

Now, introduce a non-deterministic LLM into the equation, and that maintenance nightmare scales drastically.

While traditional software is tested by asserting deterministic outputs, AI cannot be. Because the model is probabilistic, you must build tools to handle the hard, deep production issues when they inevitably strike. You need absolute visibility into the plumbing.

This means you cannot just blindly push AI-generated code to production because it "seems to work." If you don't intentionally architect observability from day one, you will have no logs and no tracking when a silent failure occurs. Your only options will be guessing or rewriting the code to add instrumentation after the fact. But that critical production error is already gone.

Real AI engineering requires building robust evaluation frameworks, tracing harnesses, and aggregated dashboards in Python. You need to see the exact context sent to the AI, the precise metadata of the response, and the drift patterns over time. Python development discipline is what gives you the tools to read the system when the sunny day ends, preventing your AI stack from turning into an unfixable black box.

AI has to survive the “real world”

People often ask what new mistakes AI projects introduce. The funny thing is that many of them aren't new at all.

The first one is what I've already mentioned earlier, i.e., AI sales pitches colliding with reality. A company hears about different AI products, sees competitors talking about AI, attends a few conferences, and suddenly the question becomes: "How do we get AI into our business?"

That's usually followed by another question: "How exactly does this fit into our existing systems?" And that's precisely where things get interesting. The integration with twenty years of accumulated legacy systems is usually less impressive than the idea.

The second problem is communication.

I've been on projects where getting an answer from a fellow engineer took five minutes because I knew the person and could just call them. Sometimes we'd worked together for years.

I've also seen the opposite. When people don't know each other, the same question might spend a week travelling through managers, product owners, and a handful of Jira tickets before reaching someone who can answer it. And when the answer finally comes back, it's often less helpful than it could have been because half the context has already disappeared.

The funny thing is that everybody involved is usually doing their job. Everybody is working hard.

And yet, at the end of the project, someone still says: "This is great. It's just not what we asked for." AI being detached from the team themselves doesn't fix that.

My first AI project was a success and failure – at the same time

My first AI integration project was about two years ago, and it was a spectacular failure. Or maybe a success – to this day, I'm not entirely sure how to think of it.

The goal was to help a customer support team process a huge volume of incoming documents. Some of those documents were routine customer communication, while others were legal letters from lawyers representing unhappy customers who had already taken the matter to court. Those two categories are not equally important.

If you miss a routine document, that's unfortunate. If you miss a legal deadline, that's expensive.

It looked like a perfect AI use case, because there were lots of documents, different formats, and a classification problem that would be difficult to solve with traditional rules. So I built it.

The integrations and classification systems worked, I even added a few extra safeguards to improve reliability. Technically, it was a success, but after two months of work the project was put on a shelf.

At first I kept wondering what had gone wrong. Looking back, I think the lesson had very little to do with the technology itself.

This was during the first big AI wave. Everywhere you looked, companies were talking about AI. Investors were asking about AI. Everyone wanted to do something with AI.

What I learned is that introducing AI isn't just about building the feature. It's about preparing people for it, helping them understand how to use it, and figuring out where it actually fits into the business.

The company eventually learned that lesson too and invested in building those capabilities.

So was the project a failure? From a purely engineering perspective, maybe. From a learning perspective, definitely not.

Risk	How it appears in AI projects	What strong engineering adds
Demo-driven decisions	Teams are impressed by isolated AI demos	Technical validation before committing
Weak integration	AI has to work with legacy systems	Safe integration points and controlled rollout
Poor observability	Bugs disappear before anyone can inspect them	Logs, tracing, metadata, dashboards
Blind trust in AI output	AI writes or verifies code without enough review	Human review, testing, output validation
Communication gaps	The system works technically but misses the business need	Better discovery, feedback loops, shared context

Python experience helps you see the bigger picture in AI projects

One thing experience gives you is the ability to recognize what kind of problem you're actually dealing with.

Some problems can be solved in code and architecture, and others can't. These are communication, organisational culture, and whether people are encouraged to learn from mistakes or punished for making them. Whether teams collaborate or work in isolation. AI doesn't change any of that.

What experienced engineers bring to AI projects is the ability to spot those different categories of risk and help clients see challenges they may not even know exist yet.

At STX Next, we combine AI expertise with decades of Python engineering experience. From backend systems and integrations to data pipelines, evaluations, and observability, we help companies build AI features that work reliably in production. Learn more about how our Python engineering for AI service can help with your upcoming project.