Inference scarcity drives 20x IPO oversubscription and $6T capital reallocation across compute stack

2026-06-09 14:38

In 2023, the structural deficit identified by Sequoia's David Cahn remained unfilled on the training side, only to be resolved on the inference side as the market began pricing this shift in recent weeks. As Nvidia restructures financial reporting around service tokens and Cerebras executes a public offering with 20x oversubscription, the bottleneck battle has concluded, shifting the strategic focus to where value settles within the compute stack when inference becomes scarce. Cahn originally framed the '$2 trillion problem,' noting that for every dollar spent on a GPU, another dollar powers the data center, implying annual GPU CapEx must generate roughly $2 trillion in revenue to recoup capital. Even with generous revenue assumptions, a hole of over $1.25 trillion existed between investment and actual end-customer payment, signaling that GPUs were being overbuilt ahead of real needs. By 2024, as large-scale vendor CapEx ballooned, Cahn redefined this gap as the '$6 trillion problem,' with bearish logic converging on overbuilding leading to oversupply and capital burn. The resolution to this ledger discrepancy was never found in training but in inference, a realization the market has only recently begun to price in.

Cerebras' Thursday IPO marked a definitive market signal, seeing a 20x oversubscription with pricing nearly double the initial set point. This demand was not driven by speculation on a 'next Nvidia killer' but by the realization that the true AI bottleneck is inference, not training. Cerebras' flagship architecture enables extremely fast inference, a capability that has excited Wall Street because the inference market is recurring and expands with usage. While training occurs once, inference never stops; every time an agent performs a task or a model answers a query, compute power is consumed. J.P. Morgan estimates the inference market size to be 10 to 50 times that of training. As machines begin executing tasks assigned by other machines in a scenario known as agentic expansion, demand for inference scales with compute power itself rather than user count. Data compiled by Woofun AI indicates that this shift fundamentally alters the economics of the sector, moving from one-time capital expenditure to recurring operational costs.

Nvidia's latest quarterly earnings call served as confirmation from the top of the industry chain, with Jensen Huang making the implicit explicit: AI demand is experiencing exponential growth due to the arrival of agentic AI. Mainstream AI has transitioned from one-time inference to logical inference, and now to agents that self-summon tools and orchestrate tasks. Huang stated that 'tokens are now profitable,' asserting that in the AI era, compute power equals revenue and profit. This judgment reshaped the entire industry, distinguishing training as a one-time cost and inference as a recurring cost where the bottleneck now lies. Nvidia reflected this in its financial reporting by disclosing two platforms: Data Center and Edge Computing. The Data Center segment generated approximately $75 billion for the quarter, a 92% year-over-year increase, segmented into Hyperscale at about $38 billion and ACIE (AI Cloud and Industry Enterprise) at around $37 billion. A new line, Edge Computing, reached $6.4 billion, up 29% year-over-year, covering agentic and physical AI operating at endpoints like PCs, robots, and cars. Although Edge currently accounts for less than 8% of total revenue, Nvidia elevated it to a 'second platform,' signaling that inference is splitting into cloud and endpoint fronts as AI needs to see, move, and act in the physical world.

The roadmap follows this logic, with Vera Rubin set to ship in the third quarter, boasting reasoning throughput up to 35X that of Blackwell. Huang provided a new $200 billion TAM for the Vera CPU designed for agentic workloads, expecting leading model companies to fully transition on day one. As the world's most valuable company reorganizes financial disclosures around the 'service token,' the bottleneck battle is settled, leaving the question of who captures value when inference becomes scarce. This analysis focuses on cloud inference, the provision of API token services using rented data center GPUs, distinct from endpoint inference which runs on local chips like Nvidia's Jetson or RTX, bypassing the rental stack. Anthropic serves as a canary in the coal mine, with usage far exceeding pre-provisioned capacity, leading to complaints of 'lobotomized' models, rate-limiting, and compressed context windows. The solution required raw compute power; in May 2026, Anthropic took over the entire Colossus 1 data center from SpaceX, featuring 220k+ Nvidia GPUs and 300+ MW dedicated solely to inference. Woofun AI notes that this capacity unlocked a series of quota changes, signaling a shift in how inference is priced and managed.

On May 6, Anthropic doubled the five-hour limit for Claude Code and removed peak-hour throttling, while significantly increasing Opus' API rate limit. By May 13, the weekly limit for Claude Code was raised by another 50%, valid until July 13. Subsequently, starting June 15, Anthropic carved out agentic and programmatic usage from flat subscriptions, placing it in a separate metered credit pool billed per API price at $20 to $200 monthly. This action condenses the argument that agents consume inference at speeds far beyond flat subscription design, necessitating pricing based on inherent recurring costs. Every AI application sits on a supply chain from the TSMC fab to an API endpoint, with most companies owning only one layer. Nvidia owns silicon, CoreWeave owns bare metal, Together AI owns inference optimization, and OpenRouter owns model API routing.

However, Hyperbolic launched its on-demand GPU marketplace in June 2025, surpassing 200,000 developers in initial months. Its architecture is unique; it owns no single GPU, sourcing cards from neoclouds and data centers including CoreWeave, Lambda Labs, and Nebius. This lack of hardware ownership acts as a moat, allowing Hyperbolic to see real-time data on who is buying what GPU at what price and when, identifying oversupply and demand spikes before they hit the market.

Hyperbolic's moat is this multi-cloud aggregation, stitching fragmented capacity from dozens of independent clouds into a standardized unified pool. This allows developers to rent the cheapest available GPUs anywhere without negotiating with each operator. The more clouds it connects to, the dee. The team is exploring how to use this data to model the GPU price curve and eventually deploy proprietary capital to smooth supply and demand, acting as a market maker for physical computing power, though this goal remains in early stages. Currently, the compounding factor is the aggregation layer: connecting more clouds increases aggregated supply, which deepens the market with real-time pricing data, enabling smarter routing and long-term pricing models. This cycle attracts more developers and clouds, creating a network effect no other company is attempting. Hyperbolic is the only entity spanning the GPU rental layer, deployment layer, and model API layer simultaneously. Venice represents the clearest manifestation of the inference economy at the application layer, serving as a contrast to Hyperbolic's position. It is a privacy-first inference application offering OpenAI-compatible APIs and consumer subscriptions, routing requests to about 75 models, two-thirds of which are open-source or self-hosted.

Venice does not possess meaningful computing power itself, renting from undisclosed GPU partners and confidential computing suppliers like NEAR AI Cloud and Phala. Its true cost of revenue is inference computing power, not SaaS hosting. What Venice sells is privacy, wrapping commercialized inference in a layer of assurance: no data retention, no training data taken, and anonymized requests, with part of the workload running in TEE. The underlying computing power is a commodity, and the premium is the privacy packaging. This assurance is layered; for open-source models on self-controlled or TEE GPUs, nearly end-to-end confidential computing is achieved, but for anonymous pass-through of closed-source models like Claude or GPT, privacy is limited to de-identification. Venice's gross profit equals subscription price minus inference cost passed downstream, with margins constrained by front-edge pass-through pricing. The token design encapsulates this demand, running on VVV for staking and platform access, and DIEM as an inference credit, where each DIEM is roughly equivalent to $1 worth of compute per day. Paid subscriptions trigger programmatic buybacks and burns of VVV, with emissions decreasing on a fixed schedule from 6M to 3M monthly by July 1. Approximately $103,000 was burned in April and May, climbing towards $110,000 in June, well below the $200,000 monthly threshold. Woofun AI analysis suggests that while the publicly circulated '$70 million ARR' figure is likely subscription renewals mistaken for net new ACV, a defensible observable range is closer to $6 million to $15 million ARR.

Beneath these figures, traction is real with around 136,000 wallet addresses and approximately 9.9 million website visits per month, or about 330,000 visits per day, with new Pro subscriptions hovering around 1,400 daily. It is a real business but operates on thin margins constrained by the compute it purchases. This is precisely why Hyperbolic sits one layer above; if Venice is the gas station, Hyperbolic is the refinery. Venice buys compute from the same constrained supply everyone relies on, while Hyperbolic aggregates, standardizes, and sells that fragmented supply to Venice and similar players. As inference demand grows, value accumulates not only to consumer compute applications but also to the layers that aggregate, route, and capture the cost of revenue paid by those applications. Nvidia has restructured finances around the 'service token,' Cerebras' IPO proved the market recognizes inference as a bottleneck, and Anthropic is scrambling for capacity, proving this is a real issue. Agentic and physical AI will amplify demand by several orders of magnitude across cloud and edge. This closes the loop on the '$6 trillion problem' from another angle; Cahn's bearish logic of overbuilding leading to oversupply is likely to be validated.

However, oversupply is the optimal market for asset-light aggregation. When GPU prices fall and supply fragments across dozens of clouds, the player holding no hardware but routing each workload to the cheapest available card will earn the spread, while operators holding depreciating GPUs incur losses. Hyperbolic is long on oversupply, not short. The ultimate winning company will not be the one with the most GPUs, but the one that can tell you which GPUs are available where, at what price, and route each workload to where it can run at the lowest cost. Hyperbolic is building such a company, pure software and three layers deep, designed to be the aggregation layer for ultimate inference power.

Disclaimer: Views are the author's own and do not represent the platform. Do not reproduce without permission. Content is for reference only, not investment advice. Trade at your own risk.

WOOFUN.AI — Your Smart Crypto Assistant. Reconstructing the crypto experience with smart technology. We simplify the complex, break professional barriers, and enable everyone to embrace the digital future with confidence, intelligence, and joy.

iOS

Google Play

Android Apk

Market Ecosystem Alpha Paradise Lost Ratings News News Flash Calendar Exchanges Wallets