One Year of LLMs' in Production

It’s been about a year since I first put an LLM into a production system. Not as a small feature, but as a replacement for a lot of business logic. Early on, it seems pretty reckless and I wasn’t convinced it could be reliable. But now it feels fine, and I’ve grown used to it, and to be honest it just works?
This post is a bit of a ramble about where LLMs have actually settled in our stack, what they’ve replaced, and what that means for how I think about building software.
At a high level, our system looks pretty simple:
- A user sends a message
- We decorate that message with relevant context (who they are, what they’ve done, what they’re allowed to do)
- An LLM processes the input
- The LLM decides whether to call tools
- Tools execute actions against our backend systems (lookups, bookings, soon payments)
- The result is turned back into a response to a user
Yep you guessed it its a glorified chat bot, that can do things for you. A chat interface on top of some data and API’s. Nothing particularly new or exciting, but the stack that powers it looks radically different to how I would usually build something like this a few years ago.
The Death (or Migration) of the Business Logic Layer
In a traditional web app, this kind of system would be built around a fairly heavy business logic layer. I would expect to see things like:
- Controllers parsing input
- Services orchestrating workflows
- Validation layers enforcing rules
- Decision trees scattered across the codebase
Every user intent would be explicitly mapped and captured in some sort behaviour driven design. A suite of statements like,
“If the user says X, and they are Y, and condition Z holds, then do A, B, C.”
That logic had to be, explicit, deterministic, and fully encoded in the code base. Today, a surprising amount of that has… disappeared. Or more accurately it has moved. Instead of writing a lot of branching logic in code, or objects that describe a particular set of business logic functions, we have:
- Capabilities describe in prompts
- Structured tool schemas that describes how an agent can perform actions
- A model that decides what to ask the user, and what actions to perform
To me this is still quite unsettling. I’m getting used to the idea but, I still can’t help but feeling its a little too vague. Even after countless testing and proof that it is reliable.
Do I just need to embrace the vibes?
The prompt has quietly become the most important part of the system. It does things that used to be spread across multiple layers:
- Interpreting user intent
- Deciding which backend operations to call
- Sequencing actions
- Handling partial information
- Recovering from ambiguity
- Handle conversations in multiple languages
In the past, you’d write a state machine or workflow engine for this. Now, you describe the workflow in natural language and let the model handle it. This works shockingly well. Even though I find it hard to admit. It just works. I still can’t shake the fact that we have swap deterministic behaviour for probabilistic behaviour. While the prompt guides the model through a user interaction, the connected tools allow the model to interact with backend systems.
As a result we have a very thin layer onto of our internal API’s that is callable by the model (its MCP today but who knows what it is tomorrow). And that’s, well that’s it. We/I am still not comfortable with LLM’s calling critical API’s directly. But I’m sure that’s not far away.
The New Failure Modes
All of these as created some interesting new modes of success and failure as we have built out this capability. Some new classes of problems include:
- Prompt drift: small wording changes causing large behavioural shifts
- Tool misuse: the model calling the right tool with slightly wrong arguments
- Debugging difficulty: no stack trace, just… vibes
All of these are solvable, for debugging your logs had better be excellent. I rely on them entirely when testing new features. When a model calls a tool with the wrong parameters it should probably try again and regenerate the arguments instead of erroring out. And any changes to prompts mean having to perform a system wide end to end test of all behaviours. It feels closer to tuning a system than programming one. We’ve gone from,
“Write code that handles every case”
to,
“Describe a system that can handle most cases, and shape its behaviour over time”
It feels less like programming and more like, training, guiding, vibing and yet it works. I suppose I am downplaying how much code is still needed. Defining connections, network requests, authentication, using libraries to call LLMS, Handling incoming and outgoing messages, logging… All that stuff is still there. But the core business logic is largely gone.
A year in, LLMs haven’t replaced software engineering. They’ve just given you new ways to solve the same problems that exist in software. They might be more suited to your use case, or less suited. You won’t know unsettling you try. And honestly, I’m still not sure if that’s a simplification or just a different kind of hard.
I’m still not sure if I like it, or if this a strange experiment that we will look back on and gone “what were we thinking”.
Curious to hear how others are structuring this—especially where you’ve drawn the line between “LLM decides” and “code decides”.