AI COMPUTE · Playbook

AI Cost-per-Inference Governance

A per-call cost ledger that tags every inference with model, product, customer cohort, and agent identity.

Time
9h
Difficulty
Intermediate
Practice Level
Best
Steps
7
Score V2 lift
+38
Fellow points
2

IFO

Playbook

The Problem

The problem

AI inference invoices arrive as a single consolidated charge per provider per month. The institution cannot attribute spend to a product, customer cohort, or model variant. As agent deployment scales, the bill compounds in lockstep with no corresponding visibility, and any guardrail conversation devolves into a quarterly emergency rather than a control surface.

The Detection

How you know it is happening

If the answer to the question, "what was the cost-per-inference of the customer-support agent yesterday, by model variant, at the customer cohort level," is anything other than a number rendered in seconds, the institution is below Best on this capability.

Practice Spectrum

AI Cost Attribution

You are here at Anti-Pattern

Anti-Pattern

Reactive

Best

Elite

Autonomous

AI inference is one consolidated invoice from the provider. Nobody can tell you the cost-per-token of the customer-support agent versus the marketing copy generator.

API keys are split per project, but attribution stops there. Per-feature, per-customer, and per-model cost is unknown.

Every inference call carries a tag for product, model, and customer cohort. Per-feature unit economics are reported monthly.

Cost-per-inference is computed per call, streamed to a real-time ledger, and exposed to the product team. Budget guardrails fire at the agent level.

AI cost is allocated per token, per call, per customer, in real time, with model card lineage and carbon disclosure attached. The agent budget is itself a controlled object.

III

The Outcome

What success looks like

A per-call cost ledger that tags every inference with model, product, customer cohort, and agent identity. A live dashboard of cost-per-inference by domain. A budget guardrail policy that fires at the agent level, before runaway prompts compound into runaway invoices.

Cost delta

-12 to -28 percent inference spend within ninety days

Efficiency

+18 efficiency points (Score V2)

Value lift

+9 value points (Score V2)

Risk reduction

-11 risk points (Score V2)

Ship It

The steps, in order.

I
Step 01
Inventory every inference surface
List every product, agent, batch job, and ad-hoc surface that calls a hosted model. Capture the provider, the model identifier, the responsible team, and the current monthly invoice line. Treat any surface without a named owner as a control finding.
II
Step 02
Provision per-surface API credentials
Issue a distinct API key, project, or service account per inference surface. Rotate any shared credentials. The aim is that every billable call can be traced to a single owner without log-side reconstruction.
bash
```
gcloud ai-platform projects create inference-customer-support \
  --organization 0123456789 \
  --billing-account 01ABCD-234EFG-567HIJ
```
III
Step 03
Attach the UFMS inference tag schema
Adopt the UFMS:003:2026 inference annex. Every inference call must carry: model_id, product_id, customer_cohort, agent_id, and intent. Bake the tag injection into the SDK wrapper used by every service. Reject calls that omit any required tag at the wrapper level.
IV
Step 04
Stream the per-call ledger to BigQuery
Pipe every inference call (cost, tokens, latency, tags, model variant) to a BigQuery table partitioned by day. Avoid sampling. The ledger is the institutional source of truth and must be complete enough to defend against an auditor or a regulator.
V
Step 05
Stand up the cost-per-inference dashboard
Render cost-per-inference by domain, product, cohort, and model variant. Layer in carbon attribution from the GreenOps annex. The dashboard is the single page the AI/ML lead opens every morning.
VI
Step 06
Wire up agent-level budget guardrails
For every agent, declare a daily and monthly budget in code. The wrapper SDK halts further calls and pages the owner once the threshold is breached. Treat the guardrail itself as a controlled object with a documented exception path.
json
```
{
  "agent_id": "support-l1",
  "daily_budget_usd": 250,
  "monthly_budget_usd": 6500,
  "on_breach": "halt-and-page",
  "owner": "ai-ml-lead@example.com"
}
```
VII
Step 07
Review weekly with the AI/ML lead and FinOps lead
Schedule a thirty-minute weekly review of the top-five most expensive agents, the top-five fastest-growing agents, and the top-five lowest-utilised agents. Each review produces a documented decision: optimise, cap, or sunset.

The Templates

Templates and starter code

json

Template

ufms-003-inference-schema.json

UFMS:003 Inference Tag Schema

Reference schema for the five required inference tags, with example JSON payloads and validation rules.

Download ↓

yaml

Template

agent-budget-policy.yaml

Agent Budget Policy Template

YAML policy file describing per-agent daily and monthly budgets, breach behaviour, and notification routing.

Download ↓

sql

Template

inference-ledger.sql

Inference Ledger BigQuery Schema

CREATE TABLE statement for the per-call inference ledger, partitioned by day, clustered by product_id and model_id.

Download ↓

The Evidence

Prove it.

I
Inventory of inference surfaces
Markdown or CSV inventory listing every inference surface, owner, model, and current monthly invoice line.
Sigstore
II
Inference ledger schema
SQL DDL for the BigQuery ledger, plus a sample seven-day export to demonstrate completeness.
Sigstore
III
Agent budget policy file
YAML policy committed to a controlled repository with named approvers and a documented exception path.
Sigstore
IV
Weekly review minutes
Four consecutive weekly review notes documenting decisions taken on the top inference surfaces.
Recorded

VII

The Impact

What this earns you.

Score V2 lift

+38

across the four Score V2 vectors. Sustained practice raises the institutional trajectory.

Distinguished Fellow

contribution points toward the Fellow lineage when published with signed evidence.

Maturity Pillars

Visibility · Allocation · Governance · Optimization

Adopters

Be among the first to publish your completion.

The cohort sample is below the publish threshold (N<5). When we have at least five completions, this panel will surface the median score lift, median cost savings, and median time to complete from the IFO4 impact API.

View impact →

VIII

Pair this with

Related playbooks.

AI Compute · Elite

Carbon-Aware ML Training

Training runs are scheduled on the basis of GPU availability and engineer convenience, not on the basis of grid carbon intensity.

Open →

Governance · Elite

Tag Enforcement at Provisioning Time

Tag policies exist on paper.

Open →

SaaS · Best

SaaS Sprawl Detection and Remediation

SaaS subscriptions accumulate quietly through expense cards, individual purchase orders, and shadow IT.

Open →

Begin the playbook

Ship it. With evidence.

Start the playbook, simulate the impact first, or take it to the community. Every move is logged.

Start this playbook Simulate the impact first Discuss in the community →

IFO4

Best PracticesAn IFO4 Platform

AI COMPUTE · Playbook

AI Cost-per-Inference Governance

A per-call cost ledger that tags every inference with model, product, customer cohort, and agent identity.

Time
9h
Difficulty
Intermediate
Practice Level
Best
Steps
7
Score V2 lift
+38
Fellow points
2

IFO

Playbook

The Problem

The problem

The Detection

How you know it is happening

Practice Spectrum

AI Cost Attribution

You are here at Anti-Pattern

Anti-Pattern

Reactive

Best

Elite

Autonomous

AI inference is one consolidated invoice from the provider. Nobody can tell you the cost-per-token of the customer-support agent versus the marketing copy generator.

API keys are split per project, but attribution stops there. Per-feature, per-customer, and per-model cost is unknown.

Every inference call carries a tag for product, model, and customer cohort. Per-feature unit economics are reported monthly.

Cost-per-inference is computed per call, streamed to a real-time ledger, and exposed to the product team. Budget guardrails fire at the agent level.

AI cost is allocated per token, per call, per customer, in real time, with model card lineage and carbon disclosure attached. The agent budget is itself a controlled object.

III

The Outcome

What success looks like

Cost delta

-12 to -28 percent inference spend within ninety days

Efficiency

+18 efficiency points (Score V2)

Value lift

+9 value points (Score V2)

Risk reduction

-11 risk points (Score V2)

Ship It

The steps, in order.

I
Step 01
Inventory every inference surface
List every product, agent, batch job, and ad-hoc surface that calls a hosted model. Capture the provider, the model identifier, the responsible team, and the current monthly invoice line. Treat any surface without a named owner as a control finding.
II
Step 02
Provision per-surface API credentials
Issue a distinct API key, project, or service account per inference surface. Rotate any shared credentials. The aim is that every billable call can be traced to a single owner without log-side reconstruction.
bash
```
gcloud ai-platform projects create inference-customer-support \
  --organization 0123456789 \
  --billing-account 01ABCD-234EFG-567HIJ
```
III
Step 03
Attach the UFMS inference tag schema
Adopt the UFMS:003:2026 inference annex. Every inference call must carry: model_id, product_id, customer_cohort, agent_id, and intent. Bake the tag injection into the SDK wrapper used by every service. Reject calls that omit any required tag at the wrapper level.
IV
Step 04
Stream the per-call ledger to BigQuery
Pipe every inference call (cost, tokens, latency, tags, model variant) to a BigQuery table partitioned by day. Avoid sampling. The ledger is the institutional source of truth and must be complete enough to defend against an auditor or a regulator.
V
Step 05
Stand up the cost-per-inference dashboard
Render cost-per-inference by domain, product, cohort, and model variant. Layer in carbon attribution from the GreenOps annex. The dashboard is the single page the AI/ML lead opens every morning.
VI
Step 06
Wire up agent-level budget guardrails
For every agent, declare a daily and monthly budget in code. The wrapper SDK halts further calls and pages the owner once the threshold is breached. Treat the guardrail itself as a controlled object with a documented exception path.
json
```
{
  "agent_id": "support-l1",
  "daily_budget_usd": 250,
  "monthly_budget_usd": 6500,
  "on_breach": "halt-and-page",
  "owner": "ai-ml-lead@example.com"
}
```
VII
Step 07
Review weekly with the AI/ML lead and FinOps lead
Schedule a thirty-minute weekly review of the top-five most expensive agents, the top-five fastest-growing agents, and the top-five lowest-utilised agents. Each review produces a documented decision: optimise, cap, or sunset.

The Templates

Templates and starter code

json

Template

ufms-003-inference-schema.json

UFMS:003 Inference Tag Schema

Reference schema for the five required inference tags, with example JSON payloads and validation rules.

Download ↓

yaml

Template

agent-budget-policy.yaml

Agent Budget Policy Template

YAML policy file describing per-agent daily and monthly budgets, breach behaviour, and notification routing.

Download ↓

sql

Template

inference-ledger.sql

Inference Ledger BigQuery Schema

CREATE TABLE statement for the per-call inference ledger, partitioned by day, clustered by product_id and model_id.

Download ↓

The Evidence

Prove it.

I
Inventory of inference surfaces
Markdown or CSV inventory listing every inference surface, owner, model, and current monthly invoice line.
Sigstore
II
Inference ledger schema
SQL DDL for the BigQuery ledger, plus a sample seven-day export to demonstrate completeness.
Sigstore
III
Agent budget policy file
YAML policy committed to a controlled repository with named approvers and a documented exception path.
Sigstore
IV
Weekly review minutes
Four consecutive weekly review notes documenting decisions taken on the top inference surfaces.
Recorded

VII

The Impact

What this earns you.

Score V2 lift

+38

across the four Score V2 vectors. Sustained practice raises the institutional trajectory.

Distinguished Fellow

contribution points toward the Fellow lineage when published with signed evidence.

Maturity Pillars

Visibility · Allocation · Governance · Optimization

Adopters

Be among the first to publish your completion.

View impact →

VIII

Pair this with

Related playbooks.

AI Compute · Elite

Carbon-Aware ML Training

Training runs are scheduled on the basis of GPU availability and engineer convenience, not on the basis of grid carbon intensity.

Open →

Governance · Elite

Tag Enforcement at Provisioning Time

Tag policies exist on paper.

Open →

SaaS · Best

SaaS Sprawl Detection and Remediation

SaaS subscriptions accumulate quietly through expense cards, individual purchase orders, and shadow IT.

Open →

Begin the playbook

Ship it. With evidence.

Start the playbook, simulate the impact first, or take it to the community. Every move is logged.

Start this playbook Simulate the impact first Discuss in the community →