Contemporary off releasing the most recent model of its Olmo basis mannequin, the Allen Institute for AI (Ai2) launched its open-source video mannequin, Molmo 2, on Tuesday, aiming to point out that smaller, open fashions might be viable choices for enterprises targeted on video understanding and evaluation.
In a press launch, the corporate mentioned Molmo 2 “takes Molmo’s strengths in grounded imaginative and prescient and expands them to video and multi-image understanding,” a functionality that has largely been dominated by bigger proprietary fashions.
Ai2 launched three variants of Molmo 2:
-
Molmo 2 8B, a Qwen-3–based mostly mannequin that Ai2 describes as its “greatest total mannequin for video grounding and QA”
-
Molmo 2 4B, designed for extra environment friendly deployments
-
Molmo 2-O 7B, constructed on the Olmo mannequin
Molmo 2 helps single-image and multi-image inputs, in addition to video clips of various lengths, enabling duties akin to video grounding, monitoring, and query answering.
“One in every of our core design targets was to shut a significant hole in open fashions: grounding,” Ai2 mentioned in its press launch.
The corporate first launched the Molmo household of open multimodal fashions final 12 months, starting with photographs. Ai2 mentioned Molmo 2 surpasses earlier variations in accuracy, temporal understanding, and pixel-level grounding, and in some circumstances performs competitively with bigger fashions akin to Google’s Gemini 3.
How Molmo 2 compares
Regardless of their smaller dimension, the Molmo 2 fashions outperformed Gemini 3 Professional and different open-weight opponents on video monitoring benchmarks.
For picture and multi-image reasoning, Ai2 mentioned Molmo 2 8B “leads all open-weight fashions, with the 4B variant shut behind.” The 8B and 4B fashions additionally confirmed robust efficiency within the open-weight Elo human desire analysis, although Ai2 famous that bigger proprietary fashions proceed to guide that benchmark total.
However Molmo 2’s largest positive factors are in video grounding and video counting, the place it outscores related open-weight fashions.
“These outcomes spotlight each progress and remaining headroom — video grounding remains to be exhausting, and no mannequin but reaches 40% accuracy," Ai2 mentioned, referring to present benchmarks.
Many video fashions, akin to Google's Veo 3.1 and OpenAI's Sora, are usually very massive. Molmo 2 targets a unique tradeoff: smaller, open fashions optimized for grounding and evaluation relatively than video technology.