Be a part of our day by day and weekly newsletters for the newest updates and unique content material on industry-leading AI protection. Study Extra
Meta’s new flagship AI language mannequin Llama 4 got here instantly over the weekend, with the dad or mum firm of Fb, Instagram, WhatsApp and Quest VR (amongst different providers and merchandise) revealing not one, not two, however three variations — all upgraded to be extra highly effective and performant utilizing the favored “Combination-of-Consultants” structure and a brand new coaching technique involving mounted hyperparameters, often called MetaP.
Additionally, all three are geared up with large context home windows — the quantity of data that an AI language mannequin can deal with in a single enter/output trade with a person or instrument.
However following the shock announcement and public launch of two of these fashions for obtain and utilization — the lower-parameter Llama 4 Scout and mid-tier Llama 4 Maverick — on Saturday, the response from the AI group on social media has been lower than adoring.
Llama 4 sparks confusion and criticism amongst AI customers
An unverified put up on the North American Chinese language language group discussion board 1point3acres made its approach over to the r/LocalLlama subreddit on Reddit alleging to be from a researcher at Meta’s GenAI group who claimed that the mannequin carried out poorly on third-party benchmarks internally and that firm management “advised mixing check units from varied benchmarks in the course of the post-training course of, aiming to fulfill the targets throughout varied metrics and produce a ‘presentable’ end result.”
The put up was met with skepticism from the group in its authenticity, and a VentureBeat electronic mail to a Meta spokesperson has not but acquired a reply.
However different customers discovered causes to doubt the benchmarks regardless.
“At this level, I extremely suspect Meta bungled up one thing within the launched weights … if not, they need to lay off everybody who labored on this after which use cash to accumulate Nous,” commented @cto_junior on X, in reference to an unbiased person check displaying Llama 4 Maverick’s poor efficiency (16%) on a benchmark often called aider polyglot, which runs a mannequin by means of 225 coding duties. That’s effectively beneath the efficiency of comparably sized, older fashions comparable to DeepSeek V3 and Claude 3.7 Sonnet.
Referencing the ten million-token context window Meta boasted for Llama 4 Scout, AI PhD and creator Andriy Burkov wrote on X partly that: “The declared 10M context is digital as a result of no mannequin was skilled on prompts longer than 256k tokens. Which means for those who ship greater than 256k tokens to it, you’re going to get low-quality output more often than not.”
Additionally on the r/LocalLlama subreddit, person Dr_Karminski wrote that “I’m extremely dissatisfied with Llama-4,” and demonstrated its poor efficiency in comparison with DeepSeek’s non-reasoning V3 mannequin on coding duties comparable to simulating balls bouncing round a heptagon.
Former Meta researcher and present AI2 (Allen Institute for Synthetic Intelligence) Senior Analysis Scientist Nathan Lambert took to his Interconnects Substack weblog on Monday to level out {that a} benchmark comparability posted by Meta to its personal Llama obtain web site of Llama 4 Maverick to different fashions, primarily based on cost-to-performance on the third-party head-to-head comparability instrument LMArena ELO aka Chatbot Area, truly used a completely different model of Llama 4 Maverick than the corporate itself had made publicly obtainable — one “optimized for conversationality.”

As Lambert wrote: “Sneaky. The outcomes beneath are faux, and it’s a main slight to Meta’s group to not launch the mannequin they used to create their main advertising and marketing push. We’ve seen many open fashions that come round to maximise on ChatBotArena whereas destroying the mannequin’s efficiency on necessary abilities like math or code.”
Lambert went on to notice that whereas this specific mannequin on the world was “tanking the technical popularity of the discharge as a result of its character is juvenile,” together with a lot of emojis and frivolous emotive dialog, “The precise mannequin on different internet hosting suppliers is kind of sensible and has an affordable tone!”
In response to the torrent of criticism and accusations of benchmark cooking, Meta’s VP and Head of GenAI Ahmad Al-Dahle took to X to state:
“We’re glad to begin getting Llama 4 in all of your arms. We’re already listening to a lot of nice outcomes persons are getting with these fashions.
That stated, we’re additionally listening to some reviews of combined high quality throughout completely different providers. Since we dropped the fashions as quickly as they have been prepared, we anticipate it’ll take a number of days for all the general public implementations to get dialed in. We’ll hold working by means of our bug fixes and onboarding companions.
We’ve additionally heard claims that we skilled on check units — that’s merely not true and we might by no means do this. Our greatest understanding is that the variable high quality persons are seeing is because of needing to stabilize implementations.
We imagine the Llama 4 fashions are a major development and we’re trying ahead to working with the group to unlock their worth.“
But even that response was met with many complaints of poor efficiency and requires additional data, comparable to extra technical documentation outlining the Llama 4 fashions and their coaching processes, in addition to extra questions on why this launch in comparison with all prior Llama releases was notably riddled with points.
It additionally comes on the heels of the quantity two at Meta’s VP of Analysis Joelle Pineau, who labored within the adjoining Meta Foundational Synthetic Intelligence Analysis (FAIR) group, asserting her departure from the corporate on LinkedIn final week with “nothing however admiration and deep gratitude for every of my managers.” Pineau, it must be famous additionally promoted the discharge of the Llama 4 mannequin household this weekend.
Llama 4 continues to unfold to different inference suppliers with combined outcomes, but it surely’s protected to say the preliminary launch of the mannequin household has not been a slam dunk with the AI group.
And the upcoming Meta LlamaCon on April 29, the primary celebration and gathering for third-party builders of the mannequin household, will doubtless have a lot fodder for dialogue. We’ll be monitoring all of it, keep tuned.