The ₹40,000 Lesson — What Nobody Tells You About Putting AI in Production

Published 2026-06-11 · 10 min read

An AI feature working perfectly in production. Users happy. All dashboards green. Nine days later — a ₹40,000 cloud bill. Here's the postmortem nobody writes about.

The Slack message came in on a Wednesday morning. Finance had pinged the engineering channel. Something about cloud bills. Could someone take a look.

The OpenAI line item had quietly grown to ₹40,000 a week. The feature it was powering — a document analysis endpoint we'd shipped a month earlier — was being hit about 300 times a day. Do the math on that. We were paying roughly ₹130 per request to a service that should have cost us ₹3 at most.

I spent the next eleven hours figuring out what happened. By the end of it, I had a list of things I wished someone had told me before I shipped my first AI feature to production. This article is that list.

Most GenAI tutorials stop at the point where you successfully call the API. That's chapter one of a much longer story. The real work — the part that decides whether your AI feature ships, scales, and stays profitable — is everything that happens around the API call.

The function calling problem nobody warns you about

The first thing that bit us was function calling. The pattern looks clean on paper. You define functions. The model picks which one to call. Your code runs it. Magic happens.

What the tutorials skip is what happens when the model makes a mistake. And it will. Maybe not on your test cases. Definitely on the input some user hands it at 2 AM on a Saturday.

In our case, the model decided to call our database query function with an unbounded LIMIT. The function executed. We didn't validate the arguments because we trusted the schema. The query pulled half a million rows into memory. The pod OOM'd. The alert that fired was about the pod, not the AI. We restarted it. The next request did the same thing. We restarted it again.

Three hours of "transient failures" before someone thought to check what the model was actually asking for.

Here's what I learned about function calling that doesn't show up in the official docs. Function arguments are user input. The model is just the messenger. Treat them with the same paranoia you'd treat a request body coming from the public internet. Validate types. Enforce bounds. Check authorization. Add timeouts. Log every call.

The model isn't writing your business logic. The model is choosing which of your business logic functions to invoke and with what parameters. That distinction matters. Your function is still your function. It still needs to be defensive. It still needs to fail gracefully. The fact that an AI decided to call it changes nothing about your obligation to validate the inputs.

A function calling design that I now use as a default pattern: every tool function has a wrapper. The wrapper validates the arguments with Zod, checks the calling user's permissions against the action, enforces a timeout, logs the call with the user ID and the arguments, and only then invokes the actual business logic. If validation fails, the wrapper returns a structured error that the model can interpret and explain to the user. The model doesn't get to call my database functions directly. It calls my wrappers. The wrappers call my functions.

That single architectural decision has prevented at least four production incidents in the last six months.

Streaming is not what you think it is

We added streaming because users complained the AI felt slow. Three seconds of dead air before the response appeared is brutal UX, even if the total response time is fine. Streaming fixes that — tokens start appearing within 200ms, and the perceived latency drops to almost nothing.

The implementation was straightforward. Server-Sent Events from the backend, EventSource on the frontend, tokens piped through as they arrived. We shipped it on a Friday afternoon, the standard mistake.

By Monday we had three new bugs in the tracker. One was about partial responses being saved to the database when the user closed the tab mid-stream. One was about the analytics dashboard showing completion rates dropping because we were measuring "request finished" but never "response actually consumed by client." One was about error states — when the OpenAI stream errored halfway through, the user saw half a sentence and a spinning wheel forever.

Streaming changes more than just the response delivery. It changes your error model. Your retry strategy. Your billing model (you're charged for what was generated, not what was consumed). Your testing strategy. Your monitoring.

The design pattern that actually works for production streaming: the backend buffers everything as it streams to the client, and only commits to the database after the stream successfully closes. Every chunk carries a sequence number. The client tracks which sequences it received, so on reconnect it can ask for what it missed. Errors mid-stream send a structured error event that the frontend can render properly, not a TCP reset. The backend tracks "first token latency" and "stream completion rate" as separate metrics — the second one is what actually matters for user experience, and it's the one most teams forget to measure.

Streaming is one of those features that looks like a small UI improvement and turns out to be a complete rewrite of your response handling pipeline. Plan for that.

Why your cache is wasting money (and how to fix it)

Back to the ₹40,000 bill. The reason it was so high was that we had no caching layer. Every request hit OpenAI fresh. Even when ten users asked variations of the same question within an hour, we paid for ten separate completions.

The first instinct is exact-match caching. Hash the prompt, store the response in Redis, return the cached value if we see the same prompt again. This is fine. It catches maybe 5% of real-world cases. Users don't ask identical questions. They ask similar questions with slight variations.

What you actually want is semantic caching. The idea: when a new prompt comes in, embed it and check if any semantically similar prompt in the cache has a response that would also be valid for this one. If yes, serve from cache. If no, hit the model.

The implementation has a few subtleties that matter. The similarity threshold needs to be tuned for your specific use case — we ended up at 0.93 cosine similarity. Anything lower and we served wrong answers. Anything higher and we barely caught anything. The threshold isn't a constant you copy from a blog post. It's a design parameter you tune against your actual traffic.

Cache invalidation is harder than usual because there's no clean trigger. The underlying knowledge might change. The model might be updated. The prompt template might evolve. We added TTLs of varying lengths based on query type — factual queries about static data get longer TTLs, queries about user-specific or time-sensitive data get shorter ones or skip the cache entirely.

The most important design decision: never cache something with PII in the prompt. Ever. Even if it's "anonymised." The cache is a shared resource and the principle of least access means user A should never get a response that was generated for user B, no matter how similar the prompt looked.

After implementing semantic caching properly, our cache hit rate stabilised at around 47%. The OpenAI bill dropped from ₹40,000 a week to ₹6,200. Same feature. Same users. Same quality of answers. Just less stupid about asking the model the same thing twice.

Observability is the part where most teams quit

This is the section that matters most and gets the least attention in tutorials.

Standard APM tools don't understand AI workloads. Datadog can tell you your endpoint took 2.3 seconds. It can't tell you whether that 2.3 seconds was a slow first token or a slow stream. It can tell you your error rate. It can't tell you that 30% of your "successful" responses were schema validation failures that you silently retried.

The metrics that matter for AI features are different. Cost per request. Tokens in, tokens out, broken down by model. Cache hit rate. Schema compliance rate. First token latency separate from total latency. Retry counts. Fallback invocations. Content filter triggers. Prompt injection attempts. Budget burn rate per user, per tenant, per feature.

None of these come out of the box. You instrument them yourself. And the instrumentation has to happen at the AI service layer, not the HTTP layer, because the same HTTP request might involve five model calls and you need to see them individually.

What I run now: every AI call goes through a single function that wraps the actual provider call. That wrapper records the model, the prompt version, the input token count, the output token count, the cost, the latency, the cache status (hit, miss, bypassed), the validation result, and the user context. It writes a row to a time-series store and emits OpenTelemetry traces. The dashboard built on top of this is the only reason I can sleep at night.

If you take one thing from this article, take this: instrument the cost per request from day one. Set a budget alert at the AI service layer that fires when any single user, tenant, or endpoint exceeds a daily threshold. The ₹40,000 problem isn't a one-time thing. It's the default outcome of shipping AI without observability. The only reason most teams haven't hit it yet is that their traffic is still low.

The architecture pattern that ties it together

After getting burned enough times, my mental model for AI features in production looks like this. There's a clean separation between the business logic of the product and the AI service layer that powers the intelligent parts. The AI service layer has its own internals — a router that picks the right provider based on cost and capability, a cache that handles both exact and semantic matching, a validator that enforces schema compliance, a budget enforcer that prevents runaway costs, and an observability layer that records everything.

The business logic doesn't know which model is being called. It doesn't know whether the response came from cache or the live API. It doesn't know how many retries happened. It calls a function called analyzeDocument() and gets back a typed result. All the AI complexity is contained behind that boundary.

This containment is what lets you change providers without rewriting your application. It's what lets you A/B test prompts without touching business logic. It's what lets you respond to a 10x cost increase from one provider by routing traffic to another. It's what lets you migrate from OpenAI to Claude or Gemini or whatever comes next, on a Tuesday afternoon, without users noticing.

The teams that get AI in production right aren't the ones who write the cleverest prompts. They're the ones who build the boring infrastructure around the AI calls. The wrappers. The validators. The caches. The observability. The boring stuff is the reason the impressive stuff works.

What's next

Module 4 comes next week and that's the one I've been wanting to write since I started this series. RAG architecture, AI agents, and MCP — the skill stack that's actually moving the salary numbers right now in India. The shift from "AI that answers questions" to "AI that takes actions" is happening fast, and most teams haven't caught up yet.

If you're shipping AI features and haven't built the layers I described in this article, your cloud bill and your incident channel are about to teach you the same lessons I learned. The good news is you don't have to learn them by losing ₹40,000 like I did.

Curious — what's the highest unexpected cloud bill you've ever caught after the fact? The hindsight is always painful. Mine's now a story I tell juniors during onboarding.

Follow me for Module 4 next week.

Please visit : My Portfolio to know more about me in detail.

#GenAI #SystemDesign #BackendDevelopment #SoftwareArchitecture #AIEngineering

All articles · Portfolio