One Year of LLMs' in Production

1 Apr 2026 · ai, llms, software, engineering

It’s been about a year since I first put an LLM into a production system. Not as a small feature, but as a replacement for a lot of business logic. Early on, it seems pretty reckless and I wasn’t convinced it could be reliable. But now it feels fine, and I’ve grown used to it, and to be honest it just works?

This post is a bit of a ramble about where LLMs have actually settled in our stack, what they’ve replaced, and what that means for how I think about building software.

At a high level, our system looks pretty simple:

A user sends a message
We decorate that message with relevant context (who they are, what they’ve done, what they’re allowed to do)
An LLM processes the input
The LLM decides whether to call tools
Tools execute actions against our backend systems (lookups, bookings, soon payments)
The result is turned back into a response to a user

Yep you guessed it its a glorified chat bot, that can do things for you. A chat interface on top of some data and API’s. Nothing particularly new or exciting, but the stack that powers it looks radically different to how I would usually build something like this a few years ago.

The Death (or Migration) of the Business Logic Layer

In a traditional web app, this kind of system would be built around a fairly heavy business logic layer. I would expect to see things like:

Controllers parsing input
Services orchestrating workflows
Validation layers enforcing rules
Decision trees scattered across the codebase

Every user intent would be explicitly mapped and captured in some sort behaviour driven design. A suite of statements like,

“If the user says X, and they are Y, and condition Z holds, then do A, B, C.”

That logic had to be, explicit, deterministic, and fully encoded in the code base. Today, a surprising amount of that has… disappeared. Or more accurately it has moved. Instead of writing a lot of branching logic in code, or objects that describe a particular set of business logic functions, we have:

Capabilities describe in prompts
Structured tool schemas that describes how an agent can perform actions
A model that decides what to ask the user, and what actions to perform

To me this is still quite unsettling. I’m getting used to the idea but, I still can’t help but feeling its a little too vague. Even after countless testing and proof that it is reliable.

Do I just need to embrace the vibes?

The prompt has quietly become the most important part of the system. It does things that used to be spread across multiple layers:

Interpreting user intent
Deciding which backend operations to call
Sequencing actions
Handling partial information
Recovering from ambiguity
Handle conversations in multiple languages

In the past, you’d write a state machine or workflow engine for this. Now, you describe the workflow in natural language and let the model handle it. This works shockingly well. Even though I find it hard to admit. It just works. I still can’t shake the fact that we have swap deterministic behaviour for probabilistic behaviour. While the prompt guides the model through a user interaction, the connected tools allow the model to interact with backend systems.

As a result we have a very thin layer onto of our internal API’s that is callable by the model (its MCP today but who knows what it is tomorrow). And that’s, well that’s it. We/I am still not comfortable with LLM’s calling critical API’s directly. But I’m sure that’s not far away.

The New Failure Modes

All of these as created some interesting new modes of success and failure as we have built out this capability. Some new classes of problems include:

Prompt drift: small wording changes causing large behavioural shifts
Tool misuse: the model calling the right tool with slightly wrong arguments
Debugging difficulty: no stack trace, just… vibes

All of these are solvable, for debugging your logs had better be excellent. I rely on them entirely when testing new features. When a model calls a tool with the wrong parameters it should probably try again and regenerate the arguments instead of erroring out. And any changes to prompts mean having to perform a system wide end to end test of all behaviours. It feels closer to tuning a system than programming one. We’ve gone from,

“Write code that handles every case”

to,

“Describe a system that can handle most cases, and shape its behaviour over time”

It feels less like programming and more like, training, guiding, vibing and yet it works. I suppose I am downplaying how much code is still needed. Defining connections, network requests, authentication, using libraries to call LLMS, Handling incoming and outgoing messages, logging… All that stuff is still there. But the core business logic is largely gone.

A year in, LLMs haven’t replaced software engineering. They’ve just given you new ways to solve the same problems that exist in software. They might be more suited to your use case, or less suited. You won’t know unsettling you try. And honestly, I’m still not sure if that’s a simplification or just a different kind of hard.

I’m still not sure if I like it, or if this a strange experiment that we will look back on and gone “what were we thinking”.

Curious to hear how others are structuring this—especially where you’ve drawn the line between “LLM decides” and “code decides”.