[ad_1]

When OpenAI went down in December, considered one of TrueFoundry’s prospects confronted a disaster that had nothing to do with chatbots or content material technology. The corporate makes use of massive language fashions to assist refill prescriptions. Each second of downtime meant hundreds of {dollars} in misplaced income — and sufferers who couldn’t entry their drugs on time.
TrueFoundry, an enterprise AI infrastructure firm, introduced Wednesday a brand new product referred to as TrueFailover designed to stop precisely that situation. The system robotically detects when AI suppliers expertise outages, slowdowns, or high quality degradation, then seamlessly reroutes site visitors to backup fashions and areas earlier than customers discover something went unsuitable.
"The problem is that within the AI world, failover is not that easy," stated Nikunj Bajaj, co-founder and chief government of TrueFoundry, in an unique interview with VentureBeat. "While you transfer from one mannequin to a different, you even have to contemplate issues like output high quality, latency, and whether or not the immediate even works the identical approach. In lots of instances, the immediate must be adjusted in real-time to stop outcomes from degrading. That isn’t one thing most groups are set as much as handle manually."
The announcement arrives at a pivotal second for enterprise AI adoption. Firms have moved far past experimentation. AI now powers prescription refills at pharmacies, generates gross sales proposals, assists software program builders, and handles buyer assist inquiries. When these methods fail, the implications ripple by means of total organizations.
Why enterprise AI methods stay dangerously depending on single suppliers
Massive language fashions from OpenAI, Anthropic, Google, and different suppliers have turn out to be important infrastructure for hundreds of companies. However in contrast to conventional cloud companies from Amazon Net Companies or Microsoft Azure — which provide strong uptime ensures backed by many years of operational expertise — AI suppliers function complicated, resource-intensive methods that stay vulnerable to surprising failures.
"Main LLM suppliers expertise outages, slowdowns, or latency spikes each few weeks or months, and we commonly see the downstream impression on companies that depend on a single supplier," Bajaj advised VentureBeat.
The December OpenAI outage that affected TrueFoundry's pharmacy buyer illustrates the stakes. "At their scale, even seconds of downtime can translate into hundreds of {dollars} in misplaced income," Bajaj defined. "Past the financial impression, there’s additionally a human consequence when sufferers can’t entry prescriptions on time. As a result of this buyer had our failover resolution in place, they have been in a position to reroute requests to a different mannequin supplier inside minutes of detecting the outage. With out that setup, restoration would possible have taken hours."
The issue extends past full outages. Partial failures — the place a mannequin slows down or produces lower-quality responses with out going absolutely offline — can quietly destroy consumer expertise and violate service-level agreements. These "sluggish however technically up" situations usually show extra damaging than dramatic crashes as a result of they evade conventional monitoring methods whereas steadily eroding efficiency.
Contained in the know-how that retains AI functions on-line when suppliers fail
TrueFailover operates as a resilience layer on high of TrueFoundry's AI Gateway, which already processes greater than 10 billion requests per thirty days for Fortune 1000 firms. The system weaves collectively a number of interconnected capabilities right into a unified security web for enterprise AI.
At its core, the product permits multi-model failover by permitting enterprises to outline main and backup fashions throughout suppliers. If OpenAI turns into unavailable, site visitors robotically shifts to Anthropic, Google's Gemini, Mistral, or self-hosted alternate options. The routing occurs transparently, with out requiring utility groups to rewrite code or manually intervene.
The system extends this safety throughout geographic boundaries by means of multi-region and multi-cloud resilience. By distributing AI endpoints throughout zones and cloud suppliers, health-based routing can detect issues in particular areas and divert site visitors to wholesome alternate options. What would in any other case turn out to be a world incident transforms into an invisible infrastructure adjustment that customers by no means understand.
Maybe most critically, TrueFailover employs degradation-aware routing that constantly displays latency, error charges, and high quality alerts. "We have a look at a mixture of alerts that collectively point out when a mannequin's efficiency is beginning to degrade," Bajaj defined. "Massive language fashions are shared sources. Suppliers run the identical mannequin occasion throughout many purchasers, so when demand spikes for one consumer or workload, it may have an effect on everybody else utilizing that mannequin."
The system watches for rising response occasions, rising error charges, and patterns suggesting instability. "Individually, none of those alerts inform the total story," Bajaj stated. "However taken collectively, they permit us to detect early indicators {that a} mannequin is slowing down or turning into unreliable. These alerts feed into an AI-driven system that may determine when and the way to reroute site visitors earlier than customers expertise a noticeable drop in high quality."
Strategic caching rounds out the safety by shielding suppliers from sudden site visitors spikes and stopping rate-limit cascades throughout high-demand durations. This permits methods to soak up demand surges and supplier limits with out brownouts or throttling surprises.
The strategy represents a elementary shift in how enterprises ought to take into consideration AI reliability. "TrueFailover is designed to deal with that complexity robotically," Bajaj stated. "It constantly displays how fashions behave throughout many purchasers and use instances, appears to be like for early warning indicators like rising latency, and takes motion earlier than issues break. Most particular person enterprises don’t have that sort of visibility as a result of they’re solely in a position to see their very own methods."
The engineering problem of switching fashions with out sacrificing output high quality
One of many thorniest challenges in AI failover entails sustaining constant output high quality when switching between fashions. A immediate optimized for GPT-5 might produce completely different outcomes on Claude or Gemini. TrueFoundry addresses this by means of a number of mechanisms that stability pace towards precision.
"Some groups depend on the truth that massive fashions have turn out to be ok that small variations in prompts don’t materially have an effect on the output," Bajaj defined. "In these instances, switching from one supplier to a different can occur with some seen impression — that's not supreme, however some groups select to do it."
Extra refined implementations preserve provider-specific prompts for a similar utility. "When site visitors shifts from one mannequin to a different, the immediate shifts with it," Bajaj stated. "In that case, failover is not only switching fashions. It’s switching to a configuration that has already been examined."
TrueFailover automates this course of. The system dynamically routes requests and adjusts prompts based mostly on which mannequin handles the question, retaining high quality inside acceptable ranges with out handbook intervention. The important thing, Bajaj emphasised, is that "failover is deliberate, not reactive. The logic, prompts, and guardrails are outlined forward of time, which is why finish customers usually don’t discover when a swap occurs."
Importantly, many failover situations don’t require altering suppliers in any respect. "It may be routing site visitors from the identical mannequin in a single area to a different area, reminiscent of from the East Coast to the West Coast, the place no immediate modifications are required," Bajaj famous. This geographic flexibility gives a primary line of protection earlier than extra complicated cross-provider switches turn out to be obligatory.
How regulated industries can use AI failover with out compromising compliance
For enterprises in healthcare, monetary companies, and different regulated sectors, the prospect of AI site visitors robotically routing to completely different suppliers raises speedy compliance issues. Affected person knowledge can’t merely stream to whichever mannequin occurs to be out there. Monetary data require strict controls over the place they journey. TrueFoundry constructed specific guardrails to handle these constraints.
"TrueFailover won’t ever route knowledge to a mannequin or supplier that an enterprise has not explicitly authorised," Bajaj stated. "Every thing is managed by means of an admin configuration layer the place groups set clear guardrails upfront."
Enterprises outline precisely which fashions qualify for failover, which suppliers can obtain site visitors, and even which areas or mannequin classes — reminiscent of closed-source versus open-source — are acceptable. As soon as these guidelines take impact, TrueFailover operates solely inside them.
"If a mannequin shouldn’t be on the authorised record, it’s merely not an possibility for routing," Bajaj emphasised. "There is no such thing as a situation the place site visitors is robotically despatched someplace surprising. The thought is to offer groups full management over compliance and knowledge boundaries, whereas nonetheless permitting the system to reply rapidly when one thing goes unsuitable. That approach, reliability improves with out compromising safety or regulatory necessities."
This design displays classes realized from TrueFoundry's current enterprise deployments. A Fortune 50 healthcare firm already makes use of the platform to deal with greater than 500 million IVR calls yearly by means of an agentic AI system. That buyer required the flexibility to run workloads throughout each cloud and on-premise infrastructure whereas sustaining strict knowledge residency controls — precisely the sort of hybrid surroundings the place failover insurance policies have to be exactly outlined.
The place computerized failover can’t assist and what enterprises should plan for
TrueFoundry acknowledges that TrueFailover can’t remedy each reliability drawback. The system operates inside the guardrails enterprises configure, and people configurations decide what safety is feasible.
"If a group permits failover from a big, high-capacity mannequin to a a lot smaller mannequin with out adjusting prompts or expectations, TrueFailover can’t assure the identical output high quality," Bajaj defined. "The system can route site visitors, however it can’t make a smaller mannequin behave like a bigger one with out acceptable configuration."
Infrastructure constraints additionally restrict safety. If an enterprise hosts its personal fashions and all of them run on the identical GPU cluster, TrueFailover can’t assist when that infrastructure fails. "When there isn’t any alternate infrastructure out there, there’s nothing to fail over to," Bajaj stated.
The query of simultaneous multi-provider failures often surfaces in enterprise danger discussions. Bajaj argues this situation, whereas theoretically doable, not often matches actuality. "In follow, 'taking place' normally doesn’t imply a whole supplier is offline throughout all fashions and areas," he defined. "What occurs way more usually is a slowdown or disruption in a particular mannequin or area due to site visitors spikes or capability points."
When that happens, failover can occur at a number of ranges — from on-premise to cloud, cloud to on-premise, one area to a different, one mannequin to a different, and even inside the similar supplier earlier than switching suppliers completely. "That alone makes it not possible that all the things fails directly," Bajaj stated. "The important thing level is that reliability is constructed on layers of redundancy. The extra suppliers, areas, and fashions which might be included within the guardrails, the smaller the prospect that customers expertise an entire outage."
A startup that constructed its platform inside Fortune 500 AI deployments
TrueFoundry has established itself as infrastructure for a number of the world's largest AI deployments, offering essential context for its failover ambitions. The corporate raised $19 million in Collection A funding in February 2025, led by Intel Capital with participation from Eniac Ventures, Peak XV Companions, and Leap Capital. Angel buyers together with Gokul Rajaram and Mohit Aron additionally joined the spherical, bringing complete funding to $21 million.
The San Francisco-based firm was based in 2021 by Bajaj and co-founders Abhishek Choudhary and Anuraag Gutgutia, all former Meta engineers who met as classmates at IIT Kharagpur. Initially targeted on accelerating machine studying deployments, TrueFoundry pivoted to assist generative AI capabilities because the know-how went mainstream in 2023.
The corporate's buyer roster demonstrates enterprise-scale adoption that few AI infrastructure startups can match. Nvidia employs TrueFoundry to construct multi-agent methods that optimize GPU cluster utilization throughout knowledge facilities worldwide — a use case the place even small enhancements in utilization translate into substantial enterprise impression given the insatiable demand for GPU capability. Undertake AI routes greater than 15 million requests and 40 billion enter tokens by means of TrueFoundry's AI Gateway to energy its enterprise agentic workflows.
Gaming firm Video games 24×7 serves machine studying fashions to greater than 100 million customers by means of the platform at scales exceeding 200 requests per second. Digital adoption platform Whatfix migrated to a microservices structure on TrueFoundry, lowering its launch cycle sixfold and chopping testing time by 40 p.c.
TrueFoundry at the moment studies greater than 30 paid prospects worldwide and has indicated it exceeded $1.5 million in annual recurring income final 12 months whereas quadrupling its buyer base. The corporate manages greater than 1,000 clusters for machine studying workloads throughout its consumer base.
TrueFailover might be provided as an add-on module on high of the present TrueFoundry AI Gateway and platform, with pricing following a usage-based mannequin tied to site visitors quantity together with the variety of customers, fashions, suppliers, and areas concerned. An early entry program for design companions opens within the coming weeks.
Why conventional cloud uptime ensures might by no means apply to AI suppliers
Enterprise know-how patrons have lengthy demanded uptime commitments from infrastructure suppliers. Amazon Net Companies, Microsoft Azure, and Google Cloud all supply service-level agreements with monetary penalties for failures. Will AI suppliers finally face related expectations?
Bajaj sees elementary constraints that make conventional SLAs troublesome to attain within the present technology of AI infrastructure. "Most foundational LLMs at present function as shared sources, which is what permits the usual pricing you see publicly marketed," he defined. "Suppliers do supply greater uptime commitments, however that normally means devoted capability or reserved infrastructure, and the fee will increase considerably."
Even with substantial budgets, enterprises face utilization quotas that create surprising publicity. "If site visitors spikes past these limits, requests can nonetheless spill again into shared infrastructure," Bajaj stated. "That makes it arduous to attain the sort of arduous ensures enterprises are used to with cloud suppliers."
The economics of operating massive language fashions create extra boundaries that will persist for years. "LLMs are nonetheless extraordinarily complicated and costly to run. They require huge infrastructure and power, and we don’t count on a near-term future the place most firms run a number of, absolutely devoted mannequin situations simply to ensure uptime."
This actuality drives demand for options like TrueFailover that present resilience no matter what particular person suppliers can promise. "Enterprises are realizing that reliability can’t come from the mannequin supplier alone," Bajaj stated. "It requires extra layers of safety to deal with the realities of how these methods function at present."
The brand new calculus for firms that constructed AI into important enterprise processes
The timing of TrueFoundry's announcement displays a elementary shift in how enterprises use AI — and what they stand to lose when it fails. What started as inner experimentation has developed into customer-facing functions the place disruptions straight have an effect on income and popularity.
"Many enterprises experimented with Gen AI and agentic methods prior to now, and manufacturing use instances have been largely internal-facing," Bajaj noticed. "There was no speedy impression on their high line or the general public notion of the enterprise."
That period has ended. "Now that these enterprises have launched public-facing functions, the place each the highest line and public notion will be impacted if an outage happens, the stakes are a lot greater than they have been even six months in the past. That's why we’re seeing increasingly consideration on this now."
For firms which have woven AI into important enterprise processes — from prescription refills to buyer assist to gross sales operations — the calculus has modified completely. The query is not which mannequin performs finest on benchmarks or which supplier gives probably the most compelling options. The query that now retains know-how leaders awake is much less complicated and way more pressing: what occurs when the AI disappears on the worst doable second?
Someplace, a pharmacist is filling a prescription. A buyer assist agent is resolving a grievance. A gross sales group is producing a proposal for a deal that closes tomorrow. All of them rely on AI methods that rely on suppliers that, regardless of their scale and class, nonetheless go darkish with out warning.
TrueFoundry is betting that enterprises can pay handsomely to make sure these moments of darkness by no means attain the individuals who matter most — their prospects.
[ad_2]