How AI is reshaping cloud economics 

AI cloud economics
Kubernetes clusters are running at an average CPU utilization of 8%. Not 8% at night, not 8% on weekends, 8% across the full year, measured across tens of thousands of organisations on AWS, GCP, and Azure. This finding comes from Cast AI’s 2026 State of Kubernetes Optimization Benchmark Report, and shows that the majority of reserved compute in production is sitting idle while the cloud bill compounds.
 
Cloud-based infrastructure made provisioning radically easier. In doing so, it removed most of the friction that historically forced efficiency. Capacity no longer needed to be planned months in advance, or justify hardware purchases, or live with the consequences of a bad forecast, organisations could just provision more. The result is a culture of overprovisioning baked into how engineering teams work. Typically, this would resemble setting resource requests high at sprint planning, never revisiting them, and treating idle compute as an acceptable cost of reliability.
 
For most of the past decade, that tradeoff was manageable. The waste was real but diffuse, spread across CPU and memory allocations that were hard to measure and easy to ignore. FinOps emerged as a discipline to surface and manage that waste, which helped, but it largely treated the symptom rather than the cause. This being that infrastructure is structurally difficult to optimise at the speed at which it changes.

AI workloads change the math

With AI infrastructure, the same structural problem gets a much higher cost multiplier. GPU compute is expensive by definition. A single H200 on AWS, for example, runs roughly $10 per hour on-demand, assuming the capacity is available.  Cast AI’s report found that only 5% of GPUs are actively utilised, while the remaining 95 % sit idle due to the mismatch between spiky demand patterns and static provisioning models. The financial implications are difficult to ignore.  
 
The demand profile exacerbates the issue. With AI inference, organisations  cannot simply configure a dynamic replica count and consider the problem solved, because by default, a single GPU is tied to a single model. Once capacity is exhausted, provisioning additional GPU is rarely straight-forward due to ongoing supply constraints and limited availability. . Static provisioning, which was already the wrong model for Kubernetes, is a particularly poor fit for AI infrastructure but it has become the default because there are no other easy options. This is one of the primary reasons why GPU sharing is gaining traction. For small models or large GPUs, for example, MIG and other GPU sharing techniques are becoming increasingly common. . It is the equivalent of VPA and HPA for GPUs.
 
Then there is token optimisation. The variance between model tiers can represent  an order of magnitude difference on per-token pricing. Organisations that route every request to a top-tier model, even when cheaper tiers would meet the quality bar for most requests, are structurally overspending in a way that is almost impossible to detect without request-level cost attribution and SLO scoring. Traditional FinOps tooling, originally designed for cloud computing, has not kept pace with the costs of LLM APIs.

Why the optimisation gap keeps growing

The gap between what organisations are spending and what they should be spending is not closing, it is actually widening, and there are structural reasons for this. Firstly, the feedback loop is too slow and by the time a cost anomaly surfaces in a dashboard or a quarterly review, the waste has already occurred, often for weeks. Cloud costs are a real-time problem being handled as a batch problem.
 
Secondly, teams that understand workload behaviour do not own the cloud budget, and teams who own the budget do not have the technical context to know which pods are overprovisioned. The optimisation work that sits between these two functions often lacks clear ownership and, as a result, receives insufficient attention. Traditional initiatives such as dashboards and more frequent reviews do little to address the structural gap. What ultimately closes it is shifting optimisation from periodic, manual intervention to continuous, automated adjustment. 
 
Thirdly, the complexity is genuinely hard. At scale, organisations are dealing with hundreds of nodes, multiple GPU instance types, and dozens of workloads with different demand profiles. Manual optimisation is not possible, as the surface area is too large and changes too fast for any team to cover with spreadsheets and monthly reviews.

The next phase: Efficiency as a financial imperative

The first phase of cloud adoption was about speed: deploy faster, scale faster, move faster. Infrastructure efficiency was a secondary concern and many  organisations still operate on that mental model.
 
The second phase, already underway for organisations with significant AI infrastructure, focuses on efficiency. Not because efficiency has become more virtuous, but because the cost of inefficiency has crossed a threshold where it materially impacts margins, capital allocation, and competitive position. The practical response is using agents to autonomously right-size resources, optimise instance selection, route workloads to appropriate compute tiers, and match GPU allocation to actual demand. 
 
There is irony to this situation.  AI workloads are generating the most significant cloud cost pressure the industry has seen, and AI-driven automation is the only practical way to address it at the scale modern infrastructure operates. Manual optimisation is not able to keep pace with the rate of change as the loop has to close continuously, and not quarterly.

What this means for finance and engineering leadership

Engineering teams have historically been rewarded for shipping fast and maintaining reliability. Cost efficiency was typically not part of this team’s responsibility and sometimes a constraint applied after the fact. That incentive structure made sense when computing was cheap relative to the value of engineering velocity.
 
The organisations that will have an advantage in the next phase of AI adoption will close the loop between infrastructure decisions and financial outcomes fastest. This requires changing what engineering teams measure, what they are held accountable for, and how quickly the feedback between spend and behaviour reaches the people who can act on it. The tooling and automation have to catch up to the speed at which AI infrastructure costs are moving.

Token economics

The world is running on tokens and every prompt, every completion, every agent loop consumes them. As AI adoption accelerates, token consumption is becoming the fastest-growing line item in cloud budgets, with almost no cost-visibility infrastructure that organisations spent the last decade building for CPU and memory.
 
The problem starts at the model tier. Frontier models from the major labs command premium pricing based on the assumption that workloads require their highest levels of performance and reasoning capability. In many cases, however, that level of sophistication is unnecessary. . A large fraction of enterprise inference workloads, summarisation, classification, extraction, internal tooling, routing logic, are tasks where a well-tuned open-source model delivers equivalent results at a fraction of the per-token cost. Organisations defaulting every request to GPT-5 or Claude Opus, simply because those were used in initial demonstrations, are not making considered architectural decisions. Instead, they are introducing significant and often invisible inefficiencies at scale, largely because few organisations are measuring quality-adjusted cost on a per-request basis. 
 
Rate-limiting compounds  this issue further. When dependent on a third-party API for inference, the provider controls throughput ceiling. That ceiling does not care about  traffic spike, product launch, or end-of-quarter batch jobs. The natural response from finance and engineering leadership has been to ration tokens by setting quotas, gating access, and slowing roll features. That response feels responsible but it isn’t practical. It is the equivalent of telling an engineering team that their laptop battery will only last five hours, and they can charge it once a day. This rationing does not solve the cost problem, and instead adds productivity challenges.
 
The actual answer is to find optimised tokens and run enough of them that rationing becomes irrelevant. Optimised tokens come from open-source models deployed on an organisation’s own infrastructure. The model quality gap between frontier APIs and open-source alternatives has narrowed dramatically. For a significant share of production workloads, it has effectively closed. When running inference on GPUs, the marginal cost of an additional token approaches zero. Instead of paying per call, organisations pay for compute capacity, and if that capacity is well-utilised, the unit economics are categorically different.

GPU fleets

The next challenge lies in GPU fleet management, where infrastructure utilisation becomes critical to the economics of AI deployment.  A GPU sitting idle between inference requests is pure waste. The utilisation profiles for AI inference follow demand curves that are highly predictable at the fleet level, even when they are unpredictable at the request level, for instance, business hours in one region, quiet hours in another, and an overnight batch in a third. 
 
An organisation with users across Asia, Europe, and the US is running three overlapping demand curves on the same underlying hardware. APAC peaks while the US sleeps. EU ramps up as APAC winds down. The US takes over as Europe goes dark. When treated as isolated capacity pools, each region appears underutilised for most of the day. Treated as a single fleet with autonomous workload distribution, the same GPUs run at dramatically higher utilisation across all three. Increasingly, organisations are adopting approaches such as GPU time-sharing and cross-fleet routing at the inference layer to dynamically match demand with available capacity in real time, reducing idle infrastructure and limiting the need for manual intervention. Platforms such as kimchi.dev are part of a broader wave of infrastructure tooling designed to support this type of automated inference optimisation.  

Latency misunderstandings

One objection that comes up consistently in these conversations is latency. The argument goes that inference workloads are latency-sensitive and that shared, distributed GPU fleets cannot meet the response time requirements of production applications. This is largely a misunderstanding of what matters in inference. The metric that drives user experience and throughput economics is not time-to-first-token, it is the total tokens generated per second across the fleet. An infrastructure that prioritises throughput over individual request latency serves more users, processes more workloads, and costs less per output token. The teams optimising for a 100-millisecond first-token response on a lightly loaded dedicated GPU are solving the wrong problem while leaving most of their compute idle.

Token and FinOps

Token FinOps is the chapter that has not been written yet. There are no established frameworks for tracking quality-adjusted cost per request, no standard benchmarks for open-source versus frontier model substitution rates, no tooling for continuous token routing optimisation analogous to what exists for CPU and memory. That gap is closing, but slowly. The organisations that build these capabilities now, combining OpenSource model deployment, on-premises inference, and autonomous GPU fleet management, are positioning themselves ahead of a cost curve that will only steepen as AI workload volumes grow. Those that fail to do so risk constraining AI adoption through cost pressures while seeing limited gains in productivity and operational efficiency. 
Laurent Gil, Founder and President at Cast AI

Laurent Gil

Laurent Gil is the Founder and President of Cast AI

Author

Scroll to Top

SUBSCRIBE

SUBSCRIBE