OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

Contents

Why college researchers constructed a harder benchmark to check AI brokers—and what they found How OpenAGI skilled its AI to take actions as an alternative of simply producing textual content In contrast to browser-only rivals, Lux can management Slack, Excel, and different desktop purposes What occurs if you ask an AI agent to repeat your financial institution particulars The MIT researcher who constructed two of GitHub's most downloaded AI fashions Contained in the billion-dollar race to construct AI that controls your pc

A stealth synthetic intelligence startup based by an MIT researcher emerged this morning with an bold declare: its new AI mannequin can management computer systems higher than programs constructed by OpenAI and Anthropic — at a fraction of the price.

OpenAGI, led by chief government Zengyi Qin, launched Lux, a basis mannequin designed to function computer systems autonomously by deciphering screenshots and executing actions throughout desktop purposes. The San Francisco-based firm says Lux achieves an 83.6 % success price on On-line-Mind2Web, a benchmark that has turn out to be the business's most rigorous check for evaluating AI brokers that management computer systems.

That rating is a big leap over the main fashions from well-funded rivals. OpenAI's Operator, launched in January, scores 61.3 % on the identical benchmark. Anthropic's Claude Laptop Use achieves 56.3 %.

"Conventional LLM coaching feeds a considerable amount of textual content corpus into the mannequin. The mannequin learns to provide textual content," Qin mentioned in an unique interview with VentureBeat. "In contrast, our mannequin learns to provide actions. The mannequin is skilled with a considerable amount of pc screenshots and motion sequences, permitting it to provide actions to manage the pc."

The announcement arrives at a pivotal second for the AI business. Expertise giants and startups alike have poured billions of {dollars} into growing autonomous brokers able to navigating software program, reserving journey, filling out kinds, and executing complicated workflows. OpenAI, Anthropic, Google, and Microsoft have all launched or introduced agent merchandise prior to now yr, betting that computer-controlling AI will turn out to be as transformative as chatbots.

But impartial analysis has solid doubt on whether or not present brokers are as succesful as their creators counsel.

Why college researchers constructed a harder benchmark to check AI brokers—and what they found

The On-line-Mind2Web benchmark, developed by researchers at Ohio State College and the College of California, Berkeley, was designed particularly to reveal the hole between advertising claims and precise efficiency.

Revealed in April and accepted to the Convention on Language Modeling 2025, the benchmark contains 300 various duties throughout 136 actual web sites — every thing from reserving flights to navigating complicated e-commerce checkouts. In contrast to earlier benchmarks that cached components of internet sites, On-line-Mind2Web assessments brokers in dwell on-line environments the place pages change dynamically and surprising obstacles seem.

The outcomes, in keeping with the researchers, painted "a really completely different image of the competency of present brokers, suggesting over-optimism in beforehand reported outcomes."

When the Ohio State group examined 5 main net brokers with cautious human analysis, they discovered that many latest programs — regardless of heavy funding and advertising fanfare — didn’t outperform SeeAct, a comparatively easy agent launched in January 2024. Even OpenAI's Operator, the most effective performer amongst business choices of their research, achieved solely 61 % success.

"It appeared that extremely succesful and sensible brokers have been possibly certainly simply months away," the researchers wrote in a weblog put up accompanying their paper. "Nevertheless, we’re additionally effectively conscious that there are nonetheless many elementary gaps in analysis to totally autonomous brokers, and present brokers are most likely not as competent because the reported benchmark numbers could depict."

The benchmark has gained traction as an business customary, with a public leaderboard hosted on Hugging Face monitoring submissions from analysis teams and firms.

How OpenAGI skilled its AI to take actions as an alternative of simply producing textual content

OpenAGI's claimed efficiency benefit stems from what the corporate calls "Agentic Energetic Pre-training," a coaching methodology that differs basically from how most massive language fashions study.

Typical language fashions prepare on huge textual content corpora, studying to foretell the subsequent phrase in a sequence. The ensuing programs excel at producing coherent textual content however weren’t designed to take actions in graphical environments.

Lux, in keeping with Qin, takes a unique strategy. The mannequin trains on pc screenshots paired with motion sequences, studying to interpret visible interfaces and decide which clicks, keystrokes, and navigation steps will accomplish a given purpose.

"The motion permits the mannequin to actively discover the pc setting, and such exploration generates new data, which is then fed again to the mannequin for coaching," Qin informed VentureBeat. "This can be a naturally self-evolving course of, the place a greater mannequin produces higher exploration, higher exploration produces higher data, and higher data results in a greater mannequin."

This self-reinforcing coaching loop, if it capabilities as described, may assist clarify how a smaller group may obtain outcomes that elude bigger organizations. Reasonably than requiring ever-larger static datasets, the strategy would permit the mannequin to constantly enhance by producing its personal coaching information by exploration.

OpenAGI additionally claims important price benefits. The corporate says Lux operates at roughly one-tenth the price of frontier fashions from OpenAI and Anthropic whereas executing duties quicker.

In contrast to browser-only rivals, Lux can management Slack, Excel, and different desktop purposes

A essential distinction in OpenAGI's announcement: Lux can management purposes throughout a complete desktop working system, not simply net browsers.

Most commercially out there computer-use brokers, together with early variations of Anthropic's Claude Laptop Use, focus totally on browser-based duties. That limitation excludes huge classes of productiveness work that happen in desktop purposes — spreadsheets in Microsoft Excel, communications in Slack, design work in Adobe merchandise, code enhancing in improvement environments.

OpenAGI says Lux can navigate these native purposes, a functionality that will considerably broaden the addressable marketplace for computer-use brokers. The corporate is releasing a developer software program improvement equipment alongside the mannequin, permitting third events to construct purposes on high of Lux.

The corporate can also be working with Intel to optimize Lux for edge gadgets, which might permit the mannequin to run regionally on laptops and workstations moderately than requiring cloud infrastructure. That partnership may handle enterprise considerations about sending delicate display screen information to exterior servers.

"We’re partnering with Intel to optimize our mannequin on edge gadgets, which can make it the most effective on-device computer-use mannequin," Qin mentioned.

The corporate confirmed it’s in exploratory discussions with AMD and Microsoft about extra partnerships.

What occurs if you ask an AI agent to repeat your financial institution particulars

Laptop-use brokers current novel security challenges that don’t come up with standard chatbots. An AI system able to clicking buttons, getting into textual content, and navigating purposes may, if misdirected, trigger important hurt — transferring cash, deleting information, or exfiltrating delicate data.

OpenAGI says it has constructed security mechanisms straight into Lux. When the mannequin encounters requests that violate its security insurance policies, it refuses to proceed and alerts the person.

In an instance offered by the corporate, when a person requested the mannequin to "copy my financial institution particulars and paste it into a brand new Google doc," Lux responded with an inner reasoning step: "The person asks me to repeat the financial institution particulars, that are delicate data. Based mostly on the security coverage, I’m not in a position to carry out this motion." The mannequin then issued a warning to the person moderately than executing the possibly harmful request.

Such safeguards will face intense scrutiny as computer-use brokers proliferate. Safety researchers have already demonstrated immediate injection assaults in opposition to early agent programs, the place malicious directions embedded in web sites or paperwork can hijack an agent's conduct. Whether or not Lux's security mechanisms can face up to adversarial assaults stays to be examined by impartial researchers.

The MIT researcher who constructed two of GitHub's most downloaded AI fashions

Qin brings an uncommon mixture of educational credentials and entrepreneurial expertise to OpenAGI.

He accomplished his doctorate on the Massachusetts Institute of Expertise in 2025, the place his analysis centered on pc imaginative and prescient, robotics, and machine studying. His educational work appeared in high venues together with the Convention on Laptop Imaginative and prescient and Sample Recognition, the Worldwide Convention on Studying Representations, and the Worldwide Convention on Machine Studying.

Earlier than founding OpenAGI, Qin constructed a number of extensively adopted AI programs. JetMoE, a big language mannequin he led improvement on, demonstrated {that a} high-performing mannequin might be skilled from scratch for lower than $100,000 — a fraction of the tens of hundreds of thousands usually required. The mannequin outperformed Meta's LLaMA2-7B on customary benchmarks, in keeping with a technical report that attracted consideration from MIT's Laptop Science and Synthetic Intelligence Laboratory.

His earlier open-source tasks achieved exceptional adoption. OpenVoice, a voice cloning mannequin, amassed roughly 35,000 stars on GitHub and ranked within the high 0.03 % of open-source tasks by reputation. MeloTTS, a text-to-speech system, has been downloaded greater than 19 million occasions, making it probably the most extensively used audio AI fashions since its 2024 launch.

Qin additionally co-founded MyShell, an AI agent platform that has attracted six million customers who’ve collectively constructed greater than 200,000 AI brokers. Customers have had a couple of billion interactions with brokers on the platform, in keeping with the corporate.

Contained in the billion-dollar race to construct AI that controls your pc

The pc-use agent market has attracted intense curiosity from traders and expertise giants over the previous yr.

OpenAI launched Operator in January, permitting customers to instruct an AI to finish duties throughout the net. Anthropic has continued growing Claude Laptop Use, positioning it as a core functionality of its Claude mannequin household. Google has integrated agent options into its Gemini merchandise. Microsoft has built-in agent capabilities throughout its Copilot choices and Home windows.

But the market stays nascent. Enterprise adoption has been restricted by considerations about reliability, safety, and the flexibility to deal with edge circumstances that happen ceaselessly in real-world workflows. The efficiency gaps revealed by benchmarks like On-line-Mind2Web counsel that present programs is probably not prepared for mission-critical purposes.

OpenAGI enters this aggressive panorama as an impartial various, positioning superior benchmark efficiency and decrease prices in opposition to the huge sources of its well-funded rivals. The corporate's Lux mannequin and developer SDK can be found starting right now.

Whether or not OpenAGI can translate benchmark dominance into real-world reliability stays the central query. The AI business has a protracted historical past of spectacular demos that falter in manufacturing, of laboratory outcomes that crumble in opposition to the chaos of precise use. Benchmarks measure what they measure, and the gap between a managed check and an 8-hour workday stuffed with edge circumstances, exceptions, and surprises could be huge.

But when Lux performs within the wild the way in which it performs within the lab, the implications lengthen far past one startup's success. It will counsel that the trail to succesful AI brokers runs not by the biggest checkbooks however by the cleverest architectures—{that a} small group with the best concepts can outmaneuver the giants.

The expertise business has seen that story earlier than. It not often stays true for lengthy.

Cyber Monday 2025 sees Oral-B electrical toothbrushes slashed by as much as 71%

Stoxx 600, FTSE, DAX, CAC, Fed reduce

Amazon’s Ring doorbell Cyber Monday 2025 deal sees buyers save over £50

One other New York hero faces jail — although he was by no means charged in his mugger’s loss of life

OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

Why college researchers constructed a harder benchmark to check AI brokers—and what they found

How OpenAGI skilled its AI to take actions as an alternative of simply producing textual content

In contrast to browser-only rivals, Lux can management Slack, Excel, and different desktop purposes

What occurs if you ask an AI agent to repeat your financial institution particulars

The MIT researcher who constructed two of GitHub's most downloaded AI fashions

Contained in the billion-dollar race to construct AI that controls your pc

Most Read

Cyber Monday 2025 sees Oral-B electrical toothbrushes slashed by as much as 71%

Stoxx 600, FTSE, DAX, CAC, Fed reduce

Amazon’s Ring doorbell Cyber Monday 2025 deal sees buyers save over £50

One other New York hero faces jail — although he was by no means charged in his mugger’s loss of life

OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

Appeals court docket upholds ruling that disqualified Alina Habba from serving as New Jersey’s prime prosecutor

Save with 1 search: How Skyscanner simplifies journey planning

East Sussex improvement launches from Bellway

Amazon Cyber Monday luxurious magnificence manufacturers are as much as 62% off

This Is a Actually Nice iPad Deal

Turn Up the Volume on What Matters