json
Template
ufms-003-inference-schema.json
UFMS:003 Inference Tag Schema
Reference schema for the five required inference tags, with example JSON payloads and validation rules.
Download ↓The Problem
AI inference invoices arrive as a single consolidated charge per provider per month. The institution cannot attribute spend to a product, customer cohort, or model variant. As agent deployment scales, the bill compounds in lockstep with no corresponding visibility, and any guardrail conversation devolves into a quarterly emergency rather than a control surface.
The Detection
If the answer to the question, "what was the cost-per-inference of the customer-support agent yesterday, by model variant, at the customer cohort level," is anything other than a number rendered in seconds, the institution is below Best on this capability.
Practice Spectrum
AI inference is one consolidated invoice from the provider. Nobody can tell you the cost-per-token of the customer-support agent versus the marketing copy generator.
API keys are split per project, but attribution stops there. Per-feature, per-customer, and per-model cost is unknown.
Every inference call carries a tag for product, model, and customer cohort. Per-feature unit economics are reported monthly.
Cost-per-inference is computed per call, streamed to a real-time ledger, and exposed to the product team. Budget guardrails fire at the agent level.
AI cost is allocated per token, per call, per customer, in real time, with model card lineage and carbon disclosure attached. The agent budget is itself a controlled object.
The Outcome
A per-call cost ledger that tags every inference with model, product, customer cohort, and agent identity. A live dashboard of cost-per-inference by domain. A budget guardrail policy that fires at the agent level, before runaway prompts compound into runaway invoices.
Cost delta
-12 to -28 percent inference spend within ninety days
Efficiency
+18 efficiency points (Score V2)
Value lift
+9 value points (Score V2)
Risk reduction
-11 risk points (Score V2)
Ship It
Step 01
List every product, agent, batch job, and ad-hoc surface that calls a hosted model. Capture the provider, the model identifier, the responsible team, and the current monthly invoice line. Treat any surface without a named owner as a control finding.
Step 02
Issue a distinct API key, project, or service account per inference surface. Rotate any shared credentials. The aim is that every billable call can be traced to a single owner without log-side reconstruction.
gcloud ai-platform projects create inference-customer-support \
--organization 0123456789 \
--billing-account 01ABCD-234EFG-567HIJStep 03
Adopt the UFMS:003:2026 inference annex. Every inference call must carry: model_id, product_id, customer_cohort, agent_id, and intent. Bake the tag injection into the SDK wrapper used by every service. Reject calls that omit any required tag at the wrapper level.
Step 04
Pipe every inference call (cost, tokens, latency, tags, model variant) to a BigQuery table partitioned by day. Avoid sampling. The ledger is the institutional source of truth and must be complete enough to defend against an auditor or a regulator.
Step 05
Render cost-per-inference by domain, product, cohort, and model variant. Layer in carbon attribution from the GreenOps annex. The dashboard is the single page the AI/ML lead opens every morning.
Step 06
For every agent, declare a daily and monthly budget in code. The wrapper SDK halts further calls and pages the owner once the threshold is breached. Treat the guardrail itself as a controlled object with a documented exception path.
{
"agent_id": "support-l1",
"daily_budget_usd": 250,
"monthly_budget_usd": 6500,
"on_breach": "halt-and-page",
"owner": "ai-ml-lead@example.com"
}Step 07
Schedule a thirty-minute weekly review of the top-five most expensive agents, the top-five fastest-growing agents, and the top-five lowest-utilised agents. Each review produces a documented decision: optimise, cap, or sunset.
The Templates
json
Template
ufms-003-inference-schema.json
Reference schema for the five required inference tags, with example JSON payloads and validation rules.
Download ↓yaml
Template
agent-budget-policy.yaml
YAML policy file describing per-agent daily and monthly budgets, breach behaviour, and notification routing.
Download ↓sql
Template
inference-ledger.sql
CREATE TABLE statement for the per-call inference ledger, partitioned by day, clustered by product_id and model_id.
Download ↓The Evidence
Inventory of inference surfaces
Markdown or CSV inventory listing every inference surface, owner, model, and current monthly invoice line.
Inference ledger schema
SQL DDL for the BigQuery ledger, plus a sample seven-day export to demonstrate completeness.
Agent budget policy file
YAML policy committed to a controlled repository with named approvers and a documented exception path.
Weekly review minutes
Four consecutive weekly review notes documenting decisions taken on the top inference surfaces.
The Impact
Adopters
The cohort sample is below the publish threshold (N<5). When we have at least five completions, this panel will surface the median score lift, median cost savings, and median time to complete from the IFO4 impact API.
Pair this with
AI Compute · Elite
Training runs are scheduled on the basis of GPU availability and engineer convenience, not on the basis of grid carbon intensity.
Open →Governance · Elite
Tag policies exist on paper.
Open →SaaS · Best
SaaS subscriptions accumulate quietly through expense cards, individual purchase orders, and shadow IT.
Open →Begin the playbook
Start the playbook, simulate the impact first, or take it to the community. Every move is logged.