Be part of the occasion trusted by enterprise leaders for almost 20 years. VB Rework brings collectively the folks constructing actual enterprise AI technique. Study extra
The gloves got here off at Tuesday at VB Rework 2025 as various chip makers instantly challenged Nvidia’s dominance narrative throughout a panel about inference, exposing a elementary contradiction: How can AI inference be a commoditized “manufacturing facility” and command 70% gross margins?
Jonathan Ross, CEO of Groq, didn’t mince phrases when discussing Nvidia’s fastidiously crafted messaging. “AI manufacturing facility is only a advertising method to make AI sound much less scary,” Ross stated through the panel. Sean Lie, CTO of Cerebras, a competitor, was equally direct: “I don’t assume Nvidia minds having the entire service suppliers combating it out for each final penny whereas they’re sitting there comfy with 70 factors.”
A whole lot of billions in infrastructure funding and the long run structure of enterprise AI are at stake. For CISOs and AI leaders presently locked in weekly negotiations with OpenAI and different suppliers for extra capability, the panel uncovered uncomfortable truths about why their AI initiatives preserve hitting roadblocks.
>>See all our Rework 2025 protection right here<<The capability disaster nobody talks about
“Anybody who’s truly a giant person of those gen AI fashions is aware of which you could go to OpenAI, or whoever it’s, they usually gained’t truly have the ability to serve you sufficient tokens,” defined Dylan Patel, founding father of SemiAnalysis. There are weekly conferences between among the greatest AI customers and their mannequin suppliers to attempt to persuade them to allocate extra capability. Then there’s weekly conferences between these mannequin suppliers and their {hardware} suppliers.”
Panel individuals additionally pointed to the token scarcity as exposing a elementary flaw within the manufacturing facility analogy. Conventional manufacturing responds to demand indicators by including capability. Nevertheless, when enterprises require 10 occasions extra inference capability, they uncover that the provision chain can’t flex. GPUs require two-year lead occasions. Information facilities want permits and energy agreements. The infrastructure wasn’t constructed for exponential scaling, forcing suppliers to ration entry via API limits.
In line with Patel, Anthropic jumped from $2 billion to $3 billion in ARR in simply six months. Cursor went from primarily zero to $500 million ARR. OpenAI crossed $10 billion. But enterprises nonetheless can’t get the tokens they want.
Why ‘Manufacturing unit’ considering breaks AI economics
Jensen Huang’s “AI manufacturing facility” idea implies standardization, commoditization and effectivity positive aspects that drive down prices. However the panel revealed three elementary methods this metaphor breaks down:
First, inference isn’t uniform. “Even at this time, for inference of, say, DeepSeek, there’s various suppliers alongside the curve of form of how briskly they supply at what price,” Patel famous. DeepSeek serves its personal mannequin on the lowest price however solely delivers 20 tokens per second. “No one desires to make use of a mannequin at 20 tokens a second. I speak quicker than 20 tokens a second.”
Second, high quality varies wildly. Ross drew a historic parallel to Commonplace Oil: “When Commonplace Oil began, oil had various high quality. You may purchase oil from one vendor and it would set your home on hearth.” Right this moment’s AI inference market faces comparable high quality variations, with suppliers utilizing varied methods to cut back prices that inadvertently compromise output high quality.
Third, and most critically, the economics are inverted. “One of many issues that’s uncommon about AI is which you could’t spend extra to get higher outcomes,” Ross defined. “You’ll be able to’t simply have a software program utility, say, I’m going to spend twice as a lot to host my software program, and purposes can get higher.”
When Ross talked about that Mark Zuckerberg praised Groq for being “the one ones who launched it with the total high quality,” he inadvertently revealed the trade’s high quality disaster. This wasn’t simply recognition. It was an indictment of each different supplier slicing corners.
Ross spelled out the mechanics: “Lots of people do lots of methods to cut back the standard, not deliberately, however to decrease their price, enhance their velocity.” The methods sound technical, however the influence is easy. Quantization reduces precision. Pruning removes parameters. Every optimization degrades mannequin efficiency in methods enterprises might not detect till manufacturing fails.
The Commonplace Oil parallel Ross drew illuminates the stakes. Right this moment’s inference market faces the identical high quality variance downside. Suppliers betting that enterprises gained’t discover the distinction between 95% and 100% accuracy are betting towards corporations like Meta which have the sophistication to measure degradation.
This creates quick imperatives for enterprise patrons.
- Set up high quality benchmarks earlier than choosing suppliers.
- Audit current inference companions for undisclosed optimizations.
- Settle for that premium pricing for full mannequin constancy is now a everlasting market characteristic. The period of assuming practical equivalence throughout inference suppliers ended when Zuckerberg known as out the distinction.
The $1 million token paradox
Probably the most revealing second got here when the panel mentioned pricing. Lie highlighted an uncomfortable reality for the trade: “If these million tokens are as worthwhile as we imagine they are often, proper? That’s not about shifting phrases. You don’t cost $1 for shifting phrases. I pay my lawyer $800 for an hour to put in writing a two-page memo.”
This remark cuts to the center of AI’s value discovery downside. The trade is racing to drive token prices beneath $1.50 per million whereas claiming these tokens will rework each facet of enterprise. The panel implicitly agreed with one another that the maths doesn’t add up.
“Just about everyone seems to be spending, like all of those fast-growing startups, the quantity that they’re spending on tokens as a service nearly matches their income one to at least one,” Ross revealed. This 1:1 spend ratio on AI tokens versus income represents an unsustainable enterprise mannequin that panel individuals contend the “manufacturing facility” narrative conveniently ignores.
Efficiency adjustments every thing
Cerebras and Groq aren’t simply competing on value; they’re additionally competing on efficiency. They’re basically altering what is feasible by way of inference velocity. “With the wafer scale know-how that we’ve constructed, we’re enabling 10 occasions, generally 50 occasions, quicker efficiency than even the quickest GPUs at this time,” Lie stated.
This isn’t an incremental enchancment. It’s enabling totally new use instances. “We’ve got prospects who’ve agentic workflows which may take 40 minutes, they usually need these items to run in actual time,” Lie defined. “These items simply aren’t even attainable, even when you’re prepared to pay prime greenback.”
The velocity differential creates a bifurcated market that defies manufacturing facility standardization. Enterprises needing real-time inference for customer-facing purposes can’t use the identical infrastructure as these working in a single day batch processes.
The actual bottleneck: energy and information facilities
Whereas everybody focuses on chip provide, the panel revealed the precise constraint throttling AI deployment. “Information middle capability is a giant downside. You’ll be able to’t actually discover information middle house within the U.S.,” Patel stated. “Energy is a giant downside.”
The infrastructure problem goes past chip manufacturing to elementary useful resource constraints. As Patel defined, “TSMC in Taiwan is ready to make over $200 million value of chips, proper? It’s not even… it’s the velocity at which they scale up is ridiculous.”
However chip manufacturing means nothing with out infrastructure. “The explanation we see these huge Center East offers, and partially why each of those corporations have huge presences within the Center East is, it’s energy,” Patel revealed. The worldwide scramble for compute has enterprises “going the world over to get wherever energy does exist, wherever information middle capability exists, wherever there are electricians who can construct these electrical programs.”
Google’s ‘success catastrophe’ turns into everybody’s actuality
Ross shared a telling anecdote from Google’s historical past: “There was a time period that turned very fashionable at Google in 2015 known as Success Catastrophe. Among the groups had constructed AI purposes that started to work higher than human beings for the primary time, and the demand for compute was so excessive, they had been going to wish to double or triple the worldwide information middle footprint shortly.”
This sample now repeats throughout each enterprise AI deployment. Functions both fail to achieve traction or expertise hockey stick development that instantly hits infrastructure limits. There’s no center floor, no easy scaling curve that manufacturing facility economics would predict.
What this implies for enterprise AI technique
For CIOs, CISOs and AI leaders, the panel’s revelations demand strategic recalibration:
Capability planning requires new fashions. Conventional IT forecasting assumes linear development. AI workloads break this assumption. When profitable purposes enhance token consumption by 30% month-to-month, annual capability plans develop into out of date inside quarters. Enterprises should shift from static procurement cycles to dynamic capability administration. Construct contracts with burst provisions. Monitor utilization weekly, not quarterly. Settle for that AI scaling patterns resemble these of viral adoption curves, not conventional enterprise software program rollouts.
Velocity premiums are everlasting. The concept that inference will commoditize to uniform pricing ignores the large efficiency gaps between suppliers. Enterprises must funds for velocity the place it issues.
Structure beats optimization. Groq and Cerebras aren’t profitable by doing GPUs higher. They’re profitable by rethinking the basic structure of AI compute. Enterprises that wager every thing on GPU-based infrastructure might discover themselves caught within the gradual lane.
Energy infrastructure is strategic. The constraint isn’t chips or software program however kilowatts and cooling. Sensible enterprises are already locking in energy capability and information middle house for 2026 and past.
The infrastructure actuality enterprises can’t ignore
The panel revealed a elementary reality: the AI manufacturing facility metaphor isn’t solely mistaken, but additionally harmful. Enterprises constructing methods round commodity inference pricing and standardized supply are planning for a market that doesn’t exist.
The actual market operates on three brutal realities.
- Capability shortage creates energy inversions, the place suppliers dictate phrases and enterprises beg for allocations.
- High quality variance, the distinction between 95% and 100% accuracy, determines whether or not your AI purposes succeed or catastrophically fail.
- Infrastructure constraints, not know-how, set the binding limits on AI transformation.
The trail ahead for CISOs and AI leaders requires abandoning manufacturing facility considering totally. Lock in energy capability now. Audit inference suppliers for hidden high quality degradation. Construct vendor relationships primarily based on architectural benefits, not marginal price financial savings. Most critically, settle for that paying 70% margins for dependable, high-quality inference could also be your smartest funding.
The choice chip makers at Rework didn’t simply problem Nvidia’s narrative. They revealed that enterprises face a alternative: pay for high quality and efficiency, or be part of the weekly negotiation conferences. The panel’s consensus was clear: success requires matching particular workloads to acceptable infrastructure slightly than pursuing one-size-fits-all options.