Just 3adly: FinOps Beyond Cloud: Flagging Which LLM Path Runs

It’s 2026, and I pretend less that coding stops at merge. Plans still matter, but margin (revenue minus cost) matters too: if usage jumps, many users pile onto cheap tiers, or ARPU (average revenue per user) stays low, is your default technical path still affordable?

If you tilt product-minded, you often want costs and revenue to steer behavior—not just quarterly decks—and you ship guardrails (limits and alerts for spend and risk) plus metrics so typo fixes aren’t casually riding flagship AI tiers.

This post is a relaxed tour of one slice of that: treating cloud and model spend as something you design for, not something you discover on the invoice.

1. What a product-minded engineer optimizes beyond “merged”

Shipped is rarely the final step. The sharper test: multiply usage, push traffic through loss-leader tiers—does profitability still behave?
Checklist framing:
Costs and revenue should sway day-to-day decisions.
Use the cheapest good-enough path first; expensive models owe you justification.
Add observability (logs/metrics showing what paths ran in prod) early so invoices don’t become the dashboard.

2. A profitability-aware seam—not just “flip the rollout switch”

Cost-aware feature flagging is more than switching release trains. (Feature flags toggle behavior remotely without reinstalling everywhere.) Core question shifts to:
For this payer plus this job step, does calling the pricey model earn its keep right now?
Two halves:
Inputs (FinOps-style facts): tooling that exposes spend (AWS Pricing Calculator, OpenCost), quotas (vendor-enforced caps), subscription revenue—not only `beta_user=true`.
Outputs: same outward job, staged execution—offline libraries; lighter Gemini Flash-class SKU; heavier Pro-class SKU when needed; delay/batch—not one greedy lane dialing the richest model every hop. (SKU or "Stock Keeping Unit" is vendor shorthand for a priced product bundle—the name on the invoice.)
Treat it like authentication (auth—checking who acts) guarding expensive work—but margin sits beside permission.

3. Models answer prompts—they don’t run budgets

A reader might ask whether frontier models magically cheap-route some of the simpler work.
Simply said: No. They optimize outputs within the SKU you bought; no accountants live there. You pick SKU tiers—they chase quality inside that sandbox. Toss a heavyweight tier at a petty job and it still replies, while metering charges requests, not vibes.
Operational decisions stay yours: routing (offline versus vendor lane), backoff (pause before retries when throttled), and hard-stop retry budgets. Models themselves won’t politely refund spend.

4. Total cost of ownership (TCO)—dashboards versus sloppy lanes

(TCO = total cost of ownership) includes quiet extras—not only billed tokens—like wrong tier ladders, retry storms, consolation “redo” generations after junk answers.
Log routing branch, approximate tokens (what vendors bill on) before stray heavyweight completions overshadow dashboard cost. Procurement warning ahead: illustrative magnitudes only—junk routing has landed teams roughly two to four orders of magnitude (~100×–10,000×) hotter than guarded paths.
Prefer modest telemetry before invoices swell.

5. From toy splitter to nearer-production routers

Toy setup uses HTTP `POST /v1/check` with JSON `{ text, task }` and tasks `spellcheck` or `summarize`.
Environment `COST_OPTIMIZATION_ENABLED` when on triggers Typo.js (open-source English spellchecking) plus an extractive summary (reuses existing sentences—invents nothing).
When off, Gemini 2.5 Flash (Google positions this as its lighter SKU string `gemini-2.5-flash`) owns both passes.
Closer to prod, swap the simple “if-this-then-that” rules for nuanced routing plus subscription and billing knobs—still feeding FinOps signals.

6. Sampling experiment—methods first; numbers follow

This section lays out (1) what mechanically reran, (2) tables in later subsections, (3) how extrapolation pencils out—lab notebook plus spreadsheet honesty, not magic.

6.1 What reran—and what got skipped deliberately

Bench here means scripted automation issuing the same deterministic requests.
Ran: fabricated multi-section write-up (~290 words) with sequential jobs (spellcheck then summarize). Express route `POST /v1/check`. Flag `COST_OPTIMIZATION_ENABLED` jointly toggles offline Typo stack versus Gemini 2.5 Flash via `@google/genai`. Then a script prints timings plus playful dollar placeholders—not accounting truth.
Skipped: proportional live traffic mixtures, randomized A/B tests (A/B: compare flows across user subsets)—this isolate covers mechanism, not population behavior.
Driving question stays:
Holding document shape steady and accepting cheaper summaries, how many vendor completions vanish?
Extrapolation (spoken plainly):
Rough monthly dollars skipped ≈ (cloud completions you avoided) × (realistic dollars per completion).
Assume each workflow run equals the spell+summarize pair (two HTTP POSTs)—linear scaling versus monthly runs. Lock those two POSTs/run, multiply by how many drafts, audits, or refactors fire monthly.

6.2 Head-to-head results (the sample)

Metric	Optimization ON	Optimization OFF
Requests	2	2
Offline routes	2	0
Cloud (Gemini)	0	2
Avg latency (ms)	~2378	~4788
Placeholder session total	$0.000000	$0.000230

The session total is only the demo knobs COST_CLOUD_SPELLCHECK + COST_CLOUD_SUMMARY in server.js—fine for a workshop, not a bill.

Plain-language takeaway for this slice: optimization on skipped every Gemini invocation that off made for these two tasks (0 vs 2 calls → 100% avoidance for this pair only). Tradeoff: extractive bullets and a full-file dictionary pass are cheap, but the spell step can land around a few seconds on a long draft (~4.7s on the first spellcheck request in my run), while the offline summary stayed single-digit ms.

6.3 Projecting to volume (drafts, edits, audits, refactors)

Illustrative token sketch—same Flash-style guesses baked into the benchmark (~2200 / 2000 in/out for the long corrective pass, ~2200 / 450 for exec summary, $0.15/M in and $0.60/M out; verify Gemini pricing). Think of a run as one time you execute spellcheck + summarize on a body—e.g. a new draft, a heavy edit, an audit pass that re-queues the pair, or a refactor that rewrites a section and re-runs the tooling.

Item	Optimization ON (this demo)	Optimization OFF (all Gemini for these two tasks)
Cloud calls / month @ 5k runs (2 POSTs per run)	0	10,000
Order-of-magnitude API $ / month	~$0	~$10.65
Avoided vs all-cloud at that volume (token model only)	~$10.65	—

Scale the numerator: at 50k runs/month—stacking draft cycles, revision rounds, audit and compliance re-checks, refactors, anything that replays this two-step cloud path—the same all-cloud token math lands near ~$106/month for that slice alone—before you add any new AI feature. The cost-aware pattern is convincing here because the growth lever is obvious: more passes through the pipeline ⇒ more invocations ⇒ the same percentage of avoided calls buys proportionally more dollars as activity grows.

6.4 Where product growth quietly multiplies cost (features, not just users)

Users rarely stop at “spellcheck + summary.” Roadmaps add adjacent model tasks: e.g. AI glossary (“explain this acronym for a non-security exec”), keyword callouts next to the summary, a risk-language nudge, email-ready rewrite, or second-pass tightening. If each is implemented as another always-on Gemini call, you get multiplication: two cloud tasks become three, four, five—each time someone runs the flow—while routing stays an afterthought.

A cost-aware seam doesn’t mean shipping worse product; it means deciding per task (plan tier, cache, template, small model, batch, human review) instead of defaulting “new AI affordance ⇒ new flagship invocation.” The bench only models two tasks, but the projection mental model extends: every new task is a coefficient on monthly variable spend unless you fold it into the same router.

6.5 Why this is still a convincing case for the approach

The sample isolates mechanism: you can see exactly which HTTP paths hit the model when the flag flips—no mystery meat in “optimization.”
Volume makes small per-call numbers real: ~$0.00107/POST on the all-cloud token estimate at 10k calls/month is easy to shrug off until draft/edit/audit/refactor volume (and features) push you to 100k+ calls.
Feature creep is the hidden multiplier: routing discipline is how you ship more AI-shaped surface area without linear-to-cloud spend on every new button.

Please note: The playground code for server.js is at the bottom of this post. The post treats that run as a sample you can scale with your own monthly pipeline volume—how often teams hit spellcheck + summarize across drafts, edits, audits, refactors, and so on—and your task list. Dollar figures mix placeholder session costs and illustrative token math; absolute savings scale with users, calls per run, model tier, and how many new AI features stay cloud-default—plug in real metering before treating any number as financial guidance.

require('dotenv').config();
const express = require('express');
const { GoogleGenAI, ApiError } = require('@google/genai');
const Typo = require('typo-js');

const app = express();
app.use(express.json());

const dictionary = new Typo('en_US');

const genai = new GoogleGenAI({
    apiKey: process.env.GEMINI_API_KEY,
});

/** Placeholder $ per cloud call (tune for FinOps demos; real bills use metering). */
const COST_CLOUD_SPELLCHECK = Number(process.env.COST_CLOUD_SPELLCHECK) || 0.00005;
const COST_CLOUD_SUMMARY = Number(process.env.COST_CLOUD_SUMMARY) || 0.00018;

/** Correct spelling per word; preserves whitespace and punctuation (en_US). */
function correctWithTypo(text) {
    return text.replace(/\b[\w']+\b/g, (word) => {
        if (dictionary.check(word)) return word;
        const suggestion = dictionary.suggest(word)[0];
        return suggestion || word;
    });
}

/** Cheap path: lead sentences + pseudo-bullets (no API). */
function extractiveExecutiveSummary(text) {
    const t = text.trim().replace(/\s+/g, ' ');
    const sentences = t.split(/(?<=[.!?])\s+/).filter((s) => s.length > 15);
    const head = sentences.slice(0, 4).join(' ');
    const base =
        head.length >= 200 ? head : t.slice(0, Math.min(1200, t.length)) + (t.length > 1200 ? '…' : '');
    const lines = base
        .split(/(?<=[.!?])\s+/)
        .filter(Boolean)
        .slice(0, 5)
        .map((s) => `- ${s.trim()}`);
    return lines.join('\n');
}

let sessionTotalCost = 0;

function isTruthyEnv(name) {
    const v = process.env[name];
    if (v == null) return false;
    return /^(1|true|yes)$/i.test(String(v).trim());
}

/** Pull a readable message out of SDK errors (often `message` is stringified JSON). */
function geminiErrorDetail(err) {
    const msg = err && typeof err.message === 'string' ? err.message : String(err);
    try {
        const parsed = JSON.parse(msg);
        const inner = parsed && parsed.error ? parsed.error : parsed;
        if (inner && typeof inner.message === 'string') {
            return { summary: inner.message, code: inner.code, status: inner.status };
        }
    } catch (_) {
        /* use raw */
    }
    return { summary: msg };
}

app.post('/v1/check', async (req, res) => {
    const { text, task } = req.body ?? {};
    if (typeof text !== 'string' || !text.trim()) {
        return res.status(400).json({ error: 'Body must include non-empty string `text`.' });
    }
    if (task !== 'spellcheck' && task !== 'summarize') {
        return res.status(400).json({
            error: 'Body must include `task`: "spellcheck" | "summarize".',
        });
    }

    const isOptimizationOn = isTruthyEnv('COST_OPTIMIZATION_ENABLED');
    const words = text.trim().split(/\s+/);

    let output = '';
    let engine = '';
    let cost = 0;

    try {
        if (task === 'spellcheck') {
            if (isOptimizationOn) {
                if (words.length <= 5) {
                    output = correctWithTypo(text);
                    engine = 'Offline (Typo.js)';
                } else {
                    output = correctWithTypo(text);
                    engine = 'Offline (Typo.js full document)';
                }
                cost = 0;
            } else {
                const model = process.env.GEMINI_MODEL || 'gemini-2.5-flash';
                const response = await genai.models.generateContent({
                    model,
                    contents: [
                        'You correct spelling and obvious typos only. Preserve structure, headings, and meaning. Reply with the full corrected text only—no preamble or quotes.',
                        `Document:\n${text}`,
                    ].join('\n\n'),
                });
                const raw = response.text;
                if (raw == null || !String(raw).trim()) {
                    throw new Error('Empty response from model');
                }
                output = String(raw).trim();
                engine = `Cloud spellcheck (${model})`;
                cost = COST_CLOUD_SPELLCHECK;
            }
        } else {
            // summarize → executive summary
            if (isOptimizationOn) {
                output = extractiveExecutiveSummary(text);
                engine = 'Offline (extractive executive summary)';
                cost = 0;
            } else {
                const model = process.env.GEMINI_MODEL || 'gemini-2.5-flash';
                const response = await genai.models.generateContent({
                    model,
                    contents: [
                        'Condense the report into an executive summary for leadership: 3–5 bullet points, plain text, each line starting with "- ". Be factual; do not invent risks or metrics.',
                        `Report:\n${text}`,
                    ].join('\n\n'),
                });
                const raw = response.text;
                if (raw == null || !String(raw).trim()) {
                    throw new Error('Empty response from model');
                }
                output = String(raw).trim();
                engine = `Cloud summary (${model})`;
                cost = COST_CLOUD_SUMMARY;
            }
        }

        sessionTotalCost += cost;
        console.log(`[${new Date().toISOString()}] task=${task} Engine: ${engine} | Cost: $${cost}`);

        res.json({
            task,
            output,
            engine,
            stats: { sessionTotal: sessionTotalCost.toFixed(6) },
        });
    } catch (error) {
        if (error instanceof ApiError) {
            const { summary, code, status: bodyStatus } = geminiErrorDetail(error);
            const httpStatus =
                typeof error.status === 'number' && error.status >= 400 && error.status <= 599
                    ? error.status
                    : 502;
            console.error('[Gemini ApiError]', httpStatus, summary);
            const label =
                httpStatus === 429
                    ? 'Gemini quota or rate limit (check plan / AI Studio quotas)'
                    : 'Gemini API error';
            return res.status(httpStatus).json({
                error: label,
                details: summary,
                geminiCode: code,
                geminiStatus: bodyStatus,
            });
        }
        console.error('[Report pipeline]', error);
        res.status(500).json({ error: 'System Error', details: error.message });
    }
});

const PORT = Number(process.env.PORT) || 3000;

const server = app.listen(PORT);
server.once('listening', () => {
    console.log(`Report spellcheck + executive summary API on port ${PORT}`);
});
server.once('error', (err) => {
    console.error('Server failed to start:', err.code === 'EADDRINUSE' ? `port ${PORT} is already in use` : err.message);
    process.exit(1);
});

Just 3adly

Tuesday, May 19, 2026

FinOps Beyond Cloud: Flagging Which LLM Path Runs