Should I use Spot for stateful workloads?
Generally no. Spot is appropriate for stateless or restartable jobs. Stateful databases and singletons should run on committed capacity. Mixed configurations are technically possible but rarely worth the operational cost.
How do I allocate shared services like ingress and DNS?
Allocate shared services proportional to namespace egress or pod count, then publish the allocation method on the chargeback report so teams can audit it. Hidden allocation methods erode trust faster than imperfect ones.
OpenCost or a paid tool?
OpenCost is the right starting point. It is open source, vendor neutral, and produces the same cost model that paid tools wrap. Move to a paid tool when you need multi cluster aggregation, optimization recommendations, or business unit reporting beyond what BigQuery joins can give you.
How do I size limits without breaking things?
Use VPA in recommendation mode for two weeks, then apply the p95 recommendation as the request and 1.5x p95 as the limit. Re run quarterly.
What about karpenter or cluster autoscaler?
Both work. Karpenter typically packs better and reacts faster on AWS; cluster autoscaler is the default elsewhere. The decision is operational, not financial. Either way, ensure the autoscaler is allowed to scale down aggressively and that idle nodes are torn down within ten minutes.
How do I justify GPU spend to the CFO?
Track cost per training run and cost per million inference tokens. Tie those numbers to a specific revenue model output. The CFO understands cost per outcome; they do not need to understand A100 versus H100.
Do I need a chargeback or a showback model?
Start with showback. Show every team their bill and their efficiency for two months. Then move to chargeback only if the showback alone has not changed behaviour. Chargeback creates accounting friction; showback is enough most of the time.