Is "cost per token" enough?
No. Cost per token is the wholesale price. The number a CFO can act on is cost per outcome (per ticket resolved, per document summarized, per decision automated).
How do I forecast a brand new feature?
Use a Fermi estimate based on expected DAU times calls per session times tokens per call. Then put a confidence interval on it that is wide enough to be honest. Refit weekly until you have one month of real data.
Should I self host?
Self hosting is rarely cheaper at small or medium scale once you account for engineering time, GPU underutilization, and on call. It becomes interesting above roughly two billion tokens per month or where data residency makes API consumption infeasible.
How do I justify the gateway investment?
Show the prompt cache hit rate it unlocks (typical first three months: 30 to 50 percent of input tokens removed) and the model routing savings (typically 20 to 40 percent). Both numbers are large and reversible if the gateway underperforms.
Do I need a separate AI budget line?
Yes. Burying AI cost inside compute or SaaS prevents the CFO from doing capacity planning. Carve out a line, even if it sits inside R and D.
How do I know when to fine tune?
Fine tune when the input prompt is large and stable, the volume is high, and a smaller fine tuned model can replace a larger model on the same eval. Run the math: training cost amortized over expected volume must beat the difference in inference cost.
How do I handle model deprecation?
Build the eval before you build the feature. When the vendor announces deprecation, run the eval against the replacement model and re negotiate the routing policy. Keep at least one fallback model approved at all times.