Currently bookingWant one like this?

Tell us what you're building. We'll come back with a plan.

Software

Shipping GPT in Production Without Burning Your Margins

Post snapshot

10 mins read

Reading time

March 19, 2026

Published

Software

Category

Mark P.

Written by

Shipping GPT in Production Without Burning Your Margins

Why Naive Integrations Bankrupt You

A founder we worked with last year shipped an MVP with a single OpenAI call per user message and no caching. The product caught a small wave on Product Hunt. Two days later their monthly OpenAI bill was higher than their AWS bill, their Vercel bill, and their entire salary budget combined. They emailed us in panic.

The fix took a week and dropped their cost per session by 92%. Nothing exotic. Just the boring, well-understood patterns most teams skip because the first iteration "worked fine in testing."

Cache Aggressively, Cache Wide

Most LLM calls are not unique. Users ask the same questions, request the same summaries, and trigger the same workflows. Every AI product we ship has at least three caching layers:

Exact-match cache on the prompt hash—a Redis lookup before the API call. This alone catches 30–50% of traffic in mature products.
Semantic cache using embeddings of past prompts. When a new prompt is within a small cosine distance of a cached one, we serve the cached completion. Tunable threshold per use case.
Result-fragment cache for multi-step chains. If three of four steps are deterministic, only the variable step needs a live call.

The semantic cache alone usually doubles your cache hit rate. The infrastructure cost is trivial compared to the API savings.

Stream by Default

Streaming is free latency. It also lets you bail out of bad completions early and refund unused tokens (in some pricing tiers). Every user-facing LLM call we ship streams unless there's a specific reason it can't.

The secondary benefit nobody talks about: streaming makes timeouts cheaper. A non-streaming call that hangs costs you the full retry. A streaming call that hangs lets you cut bait at the first sign of trouble and only pay for what you received.

The Cheap-Model Fallback Pattern

GPT-4-class models are not always necessary. For most production AI features the right pattern looks like:

1. Run a fast, cheap classifier (a small model or even a regex) to bucket the request.

2. Route 70–80% of traffic to a cheap model (`gpt-4o-mini`, Haiku, etc.).

3. Use the expensive model only for the cases the classifier flagged as hard.

This routinely cuts costs by 60% with no measurable quality drop. The work is in building the classifier and the eval set to keep it honest, not in the prompting.

Cost-Aware Retry

Default retry logic punishes you. A flaky API call triggers three exponential-backoff retries, each at full token cost, and you've now spent 4x the tokens for the same answer.

We retry once on transient errors and never on rate limits. For rate limits we surface a graceful fallback to the user—"we're a little slow right now"—and queue the request for the next window. The financial difference is significant at scale.

Eval, Not Vibes

Every AI feature we ship has a written eval set—a hundred or so labeled examples specific to the client's domain. When we change the prompt, the model, or the routing, we run the eval and see the actual numbers. Without this you're just changing things until they feel better, which is how teams ship regressions.

The eval set is also how we justify going to a cheaper model. "Eval score dropped from 91 to 89 but cost dropped from $0.43 to $0.07 per session" is a conversation you can have with a CEO. "Vibes seem fine" is not.

What We Actually Charge Our Clients

For an AI feature shipped on top of an existing app, our typical engagement is 2–3 weeks of build and a one-week monitoring tail after launch. The deliverable is not just code—it's a cost model so the founder knows what their unit economics look like at 1k, 10k, and 100k MAU.

The single most important question we ask before writing a line of LLM code is: "What's the maximum we can spend per user per month and still have a viable business?" Every architectural choice flows from that number.

Written by

Mark P.

💻 Writes about software at LevelByte. Built things at startups and agencies for the last decade.

Work with us

Contents

Currently bookingWant one like this?

Tell us what you're building. We'll come back with a plan.

Software

Shipping GPT in Production Without Burning Your Margins

Post snapshot

10 mins read

Reading time

March 19, 2026

Published

Software

Category

Mark P.

Written by

Why Naive Integrations Bankrupt You

The fix took a week and dropped their cost per session by 92%. Nothing exotic. Just the boring, well-understood patterns most teams skip because the first iteration "worked fine in testing."

Cache Aggressively, Cache Wide

Most LLM calls are not unique. Users ask the same questions, request the same summaries, and trigger the same workflows. Every AI product we ship has at least three caching layers:

Exact-match cache on the prompt hash—a Redis lookup before the API call. This alone catches 30–50% of traffic in mature products.
Semantic cache using embeddings of past prompts. When a new prompt is within a small cosine distance of a cached one, we serve the cached completion. Tunable threshold per use case.
Result-fragment cache for multi-step chains. If three of four steps are deterministic, only the variable step needs a live call.

The semantic cache alone usually doubles your cache hit rate. The infrastructure cost is trivial compared to the API savings.

Stream by Default

The Cheap-Model Fallback Pattern

GPT-4-class models are not always necessary. For most production AI features the right pattern looks like:

1. Run a fast, cheap classifier (a small model or even a regex) to bucket the request.

2. Route 70–80% of traffic to a cheap model (`gpt-4o-mini`, Haiku, etc.).

3. Use the expensive model only for the cases the classifier flagged as hard.

This routinely cuts costs by 60% with no measurable quality drop. The work is in building the classifier and the eval set to keep it honest, not in the prompting.

Cost-Aware Retry

Default retry logic punishes you. A flaky API call triggers three exponential-backoff retries, each at full token cost, and you've now spent 4x the tokens for the same answer.

Eval, Not Vibes

What We Actually Charge Our Clients

Written by

Mark P.

💻 Writes about software at LevelByte. Built things at startups and agencies for the last decade.

Work with us

Contents

Software

Shipping GPT in Production Without Burning Your Margins

Post snapshot

10 mins read

Reading time

March 19, 2026

Published

Software

Category

Mark P.

Written by

Why Naive Integrations Bankrupt You

The fix took a week and dropped their cost per session by 92%. Nothing exotic. Just the boring, well-understood patterns most teams skip because the first iteration "worked fine in testing."

Cache Aggressively, Cache Wide

Most LLM calls are not unique. Users ask the same questions, request the same summaries, and trigger the same workflows. Every AI product we ship has at least three caching layers:

Exact-match cache on the prompt hash—a Redis lookup before the API call. This alone catches 30–50% of traffic in mature products.
Semantic cache using embeddings of past prompts. When a new prompt is within a small cosine distance of a cached one, we serve the cached completion. Tunable threshold per use case.
Result-fragment cache for multi-step chains. If three of four steps are deterministic, only the variable step needs a live call.

The semantic cache alone usually doubles your cache hit rate. The infrastructure cost is trivial compared to the API savings.

Stream by Default

The Cheap-Model Fallback Pattern

GPT-4-class models are not always necessary. For most production AI features the right pattern looks like:

1. Run a fast, cheap classifier (a small model or even a regex) to bucket the request.

2. Route 70–80% of traffic to a cheap model (`gpt-4o-mini`, Haiku, etc.).

3. Use the expensive model only for the cases the classifier flagged as hard.

This routinely cuts costs by 60% with no measurable quality drop. The work is in building the classifier and the eval set to keep it honest, not in the prompting.

Cost-Aware Retry

Default retry logic punishes you. A flaky API call triggers three exponential-backoff retries, each at full token cost, and you've now spent 4x the tokens for the same answer.

Eval, Not Vibes

What We Actually Charge Our Clients

Written by

Mark P.

💻 Writes about software at LevelByte. Built things at startups and agencies for the last decade.

Work with us

Currently bookingWant one like this?

Tell us what you're building. We'll come back with a plan.

Contents

Software

Shipping GPT in Production Without Burning Your Margins

Post snapshot

10 mins read

Reading time

March 19, 2026

Published

Software

Category

Mark P.

Written by

Why Naive Integrations Bankrupt You

The fix took a week and dropped their cost per session by 92%. Nothing exotic. Just the boring, well-understood patterns most teams skip because the first iteration "worked fine in testing."

Cache Aggressively, Cache Wide

Most LLM calls are not unique. Users ask the same questions, request the same summaries, and trigger the same workflows. Every AI product we ship has at least three caching layers:

Exact-match cache on the prompt hash—a Redis lookup before the API call. This alone catches 30–50% of traffic in mature products.
Semantic cache using embeddings of past prompts. When a new prompt is within a small cosine distance of a cached one, we serve the cached completion. Tunable threshold per use case.
Result-fragment cache for multi-step chains. If three of four steps are deterministic, only the variable step needs a live call.

The semantic cache alone usually doubles your cache hit rate. The infrastructure cost is trivial compared to the API savings.

Stream by Default

The Cheap-Model Fallback Pattern

GPT-4-class models are not always necessary. For most production AI features the right pattern looks like:

1. Run a fast, cheap classifier (a small model or even a regex) to bucket the request.

2. Route 70–80% of traffic to a cheap model (`gpt-4o-mini`, Haiku, etc.).

3. Use the expensive model only for the cases the classifier flagged as hard.

This routinely cuts costs by 60% with no measurable quality drop. The work is in building the classifier and the eval set to keep it honest, not in the prompting.

Cost-Aware Retry

Default retry logic punishes you. A flaky API call triggers three exponential-backoff retries, each at full token cost, and you've now spent 4x the tokens for the same answer.

Eval, Not Vibes

What We Actually Charge Our Clients

Written by

Mark P.

💻 Writes about software at LevelByte. Built things at startups and agencies for the last decade.

Work with us

Currently bookingWant one like this?

Tell us what you're building. We'll come back with a plan.

Keep reading

How to Build a Production-Ready MVP in 6 Weeks

Software

March 4, 202510 mins read

Shipping GPT in Production Without Burning Your Margins

Why Naive Integrations Bankrupt You

Cache Aggressively, Cache Wide

Stream by Default

The Cheap-Model Fallback Pattern

Cost-Aware Retry

Eval, Not Vibes

What We Actually Charge Our Clients

Shipping GPT in Production Without Burning Your Margins

Why Naive Integrations Bankrupt You

Cache Aggressively, Cache Wide

Stream by Default

The Cheap-Model Fallback Pattern

Cost-Aware Retry

Eval, Not Vibes

What We Actually Charge Our Clients

Shipping GPT in Production Without Burning Your Margins

Why Naive Integrations Bankrupt You

Cache Aggressively, Cache Wide

Stream by Default

The Cheap-Model Fallback Pattern

Cost-Aware Retry

Eval, Not Vibes

What We Actually Charge Our Clients

Shipping GPT in Production Without Burning Your Margins

Why Naive Integrations Bankrupt You

Cache Aggressively, Cache Wide

Stream by Default

The Cheap-Model Fallback Pattern

Cost-Aware Retry

Eval, Not Vibes

What We Actually Charge Our Clients

Keep reading

How to Build a Production-Ready MVP in 6 Weeks

Why We Pick Supabase Over Firebase for Most New Builds

Designing Admin Dashboards That Operators Actually Like

Let’s buildsomethingreal.