Don't put the LLM in your backend

If you're building an AI product right now, you're probably about to make a decision that looks obvious and is actually a trap: putting the LLM inference call inside your own backend.

The reasoning seems airtight. Customers want a "just works" product. They don't want to integrate an LLM themselves, choose a model, write a prompt, or stand up an inference endpoint. So you wrap it for them. Their integration becomes "give us the input, we'll give you the output." The marketing line writes itself: "no LLM integration required."

We shipped MyAgentMail that way for the cadence engine — the multi-step outreach sequences customers use to send LinkedIn messages and emails. Step needs a message body? Our backend ran the LLM call with the customer's prompt + the lead's context and produced the content inline.

Six weeks later, we ripped it out. The shape that replaced it looks like this:

Cadence step is about to fire →
We POST to the customer's webhook (HMAC-signed):
  {
    "type": "cadence.step.draft.requested",
    "enrollment": {
      "leadName": "Jane Doe",
      "leadEmail": "[email protected]",
      "leadLinkedinUrl": "https://www.linkedin.com/in/jane-doe/",
      "currentStep": 2,
      "context": { /* whatever the customer passed at enrollment */ }
    },
    "step": {
      "kind": "linkedin_message",
      "promptHint": "Pitch around the painpoint we inferred."
    }
  }

Customer responds within 30 seconds:
  { "body": "Hey Jane — saw your post about cold outbound..." }
  OR { "skip": true, "reason": "lead unsubscribed in our CRM" }
  OR { "defer": { "untilSeconds": 3600 } }

That's the whole contract. The customer's LLM, the customer's prompt, the customer's CRM lookup, the customer's iteration loop. We never see the prompt. We never run the inference. We don't even know which model they used. We move the message — threaded correctly with the right Message-ID headers — to a real inbox the recipient can reply to, and we route the reply back through the same webhook chain.

Here are the three reasons that shape works better than the bundled version, and why most AI infrastructure plays should default to it.

1. Your customers can't debug what they can't see

When the LLM, the prompt, AND the business context all live on your servers, "why did the AI write that?" is a support ticket every single time.

There's no log the customer can read. No prompt they can tweak. No fix they can ship at 2am when their biggest account's outbound batch is going out and the messages sound wrong. They wait for you. You read the message, find the prompt template, find the customer's variables, hypothesize about why the LLM produced what it produced, deploy a tweak, ask them to retest. Half a day, minimum. Multiple times a week, at scale.

You have become the bottleneck on the part of your customer's product they care about most — their voice, their messaging, their customer relationship.

In the webhook shape, "why did the AI write that?" is a question they answer from their own logs. They see the exact prompt they sent to the model. They see the model's raw response. They tweak in 10 minutes and ship a fix without telling you. The ownership of who-debugs-what matches who-knows-what.

2. Your deploy cadence becomes their model choice

The day a meaningfully better model drops — Claude 4 → 4.5, GPT-5, whatever next month's release is — your customers want it. Within hours.

Your bundled drafter is on whatever model and prompt scaffold you shipped six months ago. You haven't migrated yet. You have other priorities. Maybe you need to validate the new model's behavior on edge cases first. Maybe the new model is more expensive and your unit economics need a rethink before you can pass it through.

So you don't migrate this week. Or next week. And your customers can't either. You're holding their content quality hostage to your release schedule. The honest answer is "we'll get to it" and the honest answer makes you look slow.

In the webhook shape, customers swap models on their side without telling you. Some run their outreach on Opus 4.7 with 1M context for deep personalization. Some run subject-line generation on Haiku for the volume play. Some are doing weird routing — first-touch on a creative model, follow-up nudges on a cheaper one. You don't care. You're not in the loop.

Your roadmap stops being about "did we migrate to the new model yet" and goes back to being about your actual product.

3. Bad output is your brand, not theirs

When a recipient gets a weird AI-written message, blame routes to whoever sent it. If your servers generated it from the customer's prompt, the customer can credibly point at you: "Their AI wrote it." That's your reputation absorbing damage from content you didn't author and can't defend, because you don't even have visibility into what their prompt asked for.

In the webhook shape, the blast radius of a bad message lives in the customer's stack. Their model, their prompt, their logs, their fix. Sounds harsh on first hearing. In practice it makes customers take quality seriously and iterate fast — the skin-in-the-game pattern that real ownership creates.

It also means when something goes wrong, you can help debug instead of being accused. "Send me the prompt and the response you got back — I can look at the network round-trip on our side and tell you if the delivery succeeded." You're operating one layer below the content problem, where you can actually be useful.

The cost is real — don't pretend otherwise

We lost the "no LLM integration required" sales line. A couple of prospects bounced because they wanted the simpler version — the white-glove "we handle the AI" shape. We don't sell to them anymore.

Onboarding got harder. Customers have to host a webhook endpoint, pay for their own model calls, and own the prompt engineering. There's real work that used to be free for them and isn't anymore.

But the customers who stayed shipped better. They had real control, real visibility, real iteration speed. Support load on our side dropped meaningfully — the "why did this message look weird" thread went from common to rare. Pricing got cleaner because we charge for the messaging infrastructure, not the inference, so we're not in the awkward "pass through OpenAI bills with a markup" position so many AI products are stuck in. Margins make sense. Quotas make sense. We don't have to model token-cost risk into every deal.

And the roadmap freed up. The hours we used to spend on prompt scaffolding, model evals, and "why did the LLM say X" tickets went into the parts of the product only we can build: delivery infrastructure, deliverability, the LinkedIn intent layer, multi-tenant isolation, the webhook delivery log so customers can see exactly what we POSTed and what they returned.

The general pattern

It's Stripe's playbook applied to AI: be the rails, not the train.

Stripe doesn't run your business logic. They don't choose your products, set your prices, write your invoice copy, or decide whether to refund a customer. They move money according to rules you define. The intelligence about what to charge and why lives on your side; the durable infrastructure about how to charge lives on theirs.

The right shape for most AI infrastructure is the same pattern. Move the information. Trigger the events. Handle the delivery. Do the boring durable stuff customers can't or shouldn't build themselves — auth, billing, retry semantics, observability, multi-tenant isolation, the obscure protocol edge cases that take years to get right.

The intelligence — the prompts, the model choice, the business context, the iteration loop — is the customer's, because that's the part they actually care about owning. It's their voice with their customers. They will iterate on it more than you ever will.

If you're tempted to bundle the inference

At least sit with the unbundled version first. Draw the contract. What does the webhook payload look like? What does the response shape need to support — just {body}, or also {skip}, {defer}, retries? What's your timeout budget? What's your HMAC scheme? What does the delivery log look like for customers who want to debug?

The shape of your API is your customer's hint about which layer you plan to own. Bundle one too many things "for their convenience" and you've quietly become the bottleneck on the part of their product they care about most.

You can always add the bundled shape later as a convenience layer on top. The reverse — pulling the inference out of your backend after customers have built around the bundled shape — is a breaking change you'll never want to make. The unbundled version is the safer default.

MyAgentMail is email + LinkedIn infrastructure for AI agents. The webhook-drafter pattern above is what every cadence step using draftStrategy: "webhook" does today. If you want to see the exact contract, it's in the OpenAPI spec under cadence.step.draft.requested, and the SKILL.md drops into any Claude Code / Cursor / Windsurf config without signup.

Don't put the LLM in your backend

Don't put the LLM in your backend

1. Your customers can't debug what they can't see

2. Your deploy cadence becomes their model choice

3. Bad output is your brand, not theirs

The cost is real — don't pretend otherwise

The general pattern

If you're tempted to bundle the inference

Related reading