It looks like nearly each week for the final two years since ChatGPT launched, new massive language fashions (LLMs) from rival labs or from OpenAI itself have been launched. Enterprises are arduous pressed to maintain up with the large tempo of change, not to mention perceive tips on how to adapt to it — which of those new fashions ought to they undertake, if any, to energy their workflows and the customized AI brokers they're constructing to hold them out?
Assist has arrived: AI functions observability startup Raindrop has launched Experiments, a brand new analytics characteristic that the corporate describes as the primary A/B testing suite designed particularly for enterprise AI brokers — permitting firms to see and examine how updating brokers to new underlying fashions, or altering their directions and power entry, will impression their efficiency with actual finish customers.
The discharge extends Raindrop’s current observability instruments, giving builders and groups a strategy to see how their brokers behave and evolve in real-world circumstances.
With Experiments, groups can observe how adjustments — similar to a brand new software, immediate, mannequin replace, or full pipeline refactor — have an effect on AI efficiency throughout thousands and thousands of consumer interactions. The brand new characteristic is obtainable now for customers on Raindrop’s Professional subscription plan ($350 month-to-month) at raindrop.ai.
A Knowledge-Pushed Lens on Agent Growth
Raindrop co-founder and chief expertise officer Ben Hylak famous in a product announcement video (above) that Experiments helps groups see “how actually something modified,” together with software utilization, consumer intents, and subject charges, and to discover variations by demographic elements similar to language. The aim is to make mannequin iteration extra clear and measurable.
The Experiments interface presents outcomes visually, displaying when an experiment performs higher or worse than its baseline. Will increase in unfavorable alerts would possibly point out increased activity failure or partial code output, whereas enhancements in optimistic alerts may replicate extra full responses or higher consumer experiences.
By making this knowledge simple to interpret, Raindrop encourages AI groups to strategy agent iteration with the identical rigor as trendy software program deployment—monitoring outcomes, sharing insights, and addressing regressions earlier than they compound.
Background: From AI Observability to Experimentation
Raindrop’s launch of Experiments builds on the corporate’s basis as one of many first AI-native observability platforms, designed to assist enterprises monitor and perceive how their generative AI methods behave in manufacturing.
As VentureBeat reported earlier this 12 months, the corporate — initially often called Daybreak AI — emerged to handle what Hylak, a former Apple human interface designer, known as the “black field downside” of AI efficiency, serving to groups catch failures “as they occur and clarify to enterprises what went unsuitable and why."
On the time, Hylak described how “AI merchandise fail consistently—in methods each hilarious and terrifying,” noting that not like conventional software program, which throws clear exceptions, “AI merchandise fail silently.” Raindrop’s unique platform centered on detecting these silent failures by analyzing alerts similar to consumer suggestions, activity failures, refusals, and different conversational anomalies throughout thousands and thousands of each day occasions.
The corporate’s co-founders— Hylak, Alexis Gauba, and Zubin Singh Koticha — constructed Raindrop after encountering firsthand the problem of debugging AI methods in manufacturing.
“We began by constructing AI merchandise, not infrastructure,” Hylak instructed VentureBeat. “However fairly rapidly, we noticed that to develop something critical, we would have liked tooling to grasp AI habits—and that tooling didn’t exist.”
With Experiments, Raindrop extends that very same mission from detecting failures to measuring enhancements. The brand new software transforms observability knowledge into actionable comparisons, letting enterprises check whether or not adjustments to their fashions, prompts, or pipelines truly make their AI brokers higher—or simply completely different.
Fixing the “Evals Cross, Brokers Fail” Downside
Conventional analysis frameworks, whereas helpful for benchmarking, not often seize the unpredictable habits of AI brokers working in dynamic environments.
As Raindrop co-founder Alexis Gauba defined in her LinkedIn announcement, “Conventional evals don’t actually reply this query. They’re nice unit checks, however you’ll be able to’t predict your consumer’s actions and your agent is working for hours, calling lots of of instruments.”
Gauba mentioned the corporate persistently heard a typical frustration from groups: “Evals go, brokers fail.”
Experiments is supposed to shut that hole by displaying what truly adjustments when builders ship updates to their methods.
The software permits side-by-side comparisons of fashions, instruments, intents, or properties, surfacing measurable variations in habits and efficiency.
Designed for Actual-World AI Conduct
Within the announcement video, Raindrop described Experiments as a strategy to “examine something and measure how your agent’s habits truly modified in manufacturing throughout thousands and thousands of actual interactions.”
The platform helps customers spot points similar to activity failure spikes, forgetting, or new instruments that set off sudden errors.
It can be utilized in reverse — ranging from a identified downside, similar to an “agent caught in a loop,” and tracing again to which mannequin, software, or flag is driving it.
From there, builders can dive into detailed traces to seek out the foundation trigger and ship a repair rapidly.
Every experiment gives a visible breakdown of metrics like software utilization frequency, error charges, dialog length, and response size.
Customers can click on on any comparability to entry the underlying occasion knowledge, giving them a transparent view of how agent habits modified over time. Shared hyperlinks make it simple to collaborate with teammates or report findings.
Integration, Scalability, and Accuracy
In line with Hylak, Experiments integrates straight with “the characteristic flag platforms firms know and love (like Statsig!)” and is designed to work seamlessly with current telemetry and analytics pipelines.
For firms with out these integrations, it could actually nonetheless examine efficiency over time—similar to yesterday versus at the moment—with out extra setup.
Hylak mentioned groups sometimes want round 2,000 customers per day to supply statistically significant outcomes.
To make sure the accuracy of comparisons, Experiments displays for pattern measurement adequacy and alerts customers if a check lacks sufficient knowledge to attract legitimate conclusions.
“We obsess over ensuring metrics like Activity Failure and Person Frustration are metrics that you simply’d get up an on-call engineer for,” Hylak defined. He added that groups can drill into the particular conversations or occasions that drive these metrics, guaranteeing transparency behind each mixture quantity.
Safety and Knowledge Safety
Raindrop operates as a cloud-hosted platform but additionally gives on-premise personally identifiable data (PII) redaction for enterprises that want extra management.
Hylak mentioned the corporate is SOC 2 compliant and has launched a PII Guard characteristic that makes use of AI to robotically take away delicate data from saved knowledge. “We take defending buyer knowledge very severely,” he emphasised.
Pricing and Plans
Experiments is a part of Raindrop’s Professional plan, which prices $350 per thirty days or $0.0007 per interplay. The Professional tier additionally contains deep analysis instruments, subject clustering, customized subject monitoring, and semantic search capabilities.
Raindrop’s Starter plan — $65 per thirty days or $0.001 per interplay — gives core analytics together with subject detection, consumer suggestions alerts, Slack alerts, and consumer monitoring. Each plans include a 14-day free trial.
Bigger organizations can go for an Enterprise plan with customized pricing and superior options like SSO login, customized alerts, integrations, edge-PII redaction, and precedence help.
Steady Enchancment for AI Techniques
With Experiments, Raindrop positions itself on the intersection of AI analytics and software program observability. Its deal with “measure reality,” as acknowledged within the product video, displays a broader push inside the business towards accountability and transparency in AI operations.
Somewhat than relying solely on offline benchmarks, Raindrop’s strategy emphasizes actual consumer knowledge and contextual understanding. The corporate hopes this may permit AI builders to maneuver quicker, determine root causes sooner, and ship better-performing fashions with confidence.