
- Hugging Face’s Thomas Wolf says that it is getting tougher to inform which AI mannequin is one of the best as conventional AI benchmarks turn into saturated. Going ahead, Wolfe stated the AI business may depend on two new benchmarking approaches—company‑primarily based and use‑case‑particular.
Thomas Wolf, co‑founder and chief scientist at Hugging Face, thinks we may have new methods to measure AI fashions.
Wolf instructed the viewers at Brainstorm AI in London that as AI fashions get extra superior, it is changing into more and more troublesome to inform which one is performing one of the best.
“It’s getting laborious to inform what one of the best mannequin is,” he stated, pointing to the nominal variations between latest releases from OpenAI and Google. “All of them appear to be, really, very shut.”
“The world of benchmarks has advanced loads. We used to have this very educational benchmark that we principally measured the data of the mannequin on—I believe probably the most well-known was MMLU (Large Multitask Language Understanding), which was principally a set of graduate‑degree or PhD‑degree questions that the mannequin needed to reply,” he stated. “These benchmarks are principally all saturated proper now.”
Over the previous yr, there was a rising refrain of voices from academia, business, and coverage claiming that frequent AI benchmarks, akin to MMLU, GLUE, and HellaSwag, have reached saturation, could be gamed, and now not replicate actual‑world utility.
In a examine printed in February, researchers on the European Fee’s Joint Analysis Centre, printed a paper known as “Can We Belief AI Benchmarks? An Interdisciplinary Evaluate of Present Points in AI Analysis” that discovered “systemic flaws in present benchmarking practices”—together with misaligned incentives, assemble‑validity failures, gaming of outcomes and information‑contamination.
Going ahead, Wolf stated the AI business ought to depend on two major kinds of benchmarks going into 2025: one for assessing the company of the fashions, the place LLMs are anticipated to do duties, and the opposite tailor-made to every use case for fashions.
Hugging Face is already engaged on the latter.
The corporate’s new program, “Your Bench,” goals to assist customers decide which mannequin to make use of for a selected job. Customers feed a couple of paperwork into this system, which then mechanically generates a selected benchmark for the kind of work that customers can apply to completely different fashions to see which one is greatest for the use case.
“Simply because these fashions are all working the identical on this educational benchmark doesn’t actually imply that they’re all precisely the identical,” Wolf stated.
Open‑supply’s ‘ChatGPT second’
Based by Wolf, Clément Delangue, and Julien Chaumond in 2016, Hugging Face has lengthy been a champion of open‑supply AI.
Sometimes called the GitHub of machine studying, the corporate supplies an open‑supply platform that permits builders, researchers, and enterprises to construct, share, and deploy machine‑studying fashions, datasets, and functions at scale. Customers may browse fashions and datasets that others have uploaded.
Wolfe instructed the Brainstorm AI viewers that Hugging Face’s “enterprise mannequin is de facto aligned with open supply” and the corporate’s “objective is to have the utmost variety of folks taking part in this type of open group and sharing fashions.”
Wolfe predicted that open‑supply AI would proceed to thrive, particularly after the success of DeepSeek earlier this yr.
After its launch late final yr, the Chinese language‑made AI mannequin DeepSeek R1 despatched shockwaves by way of the AI world when testers discovered that it matched and even outperformed American closed‑supply AI fashions.
Wolf stated DeepSeek was a “ChatGPT second” for open‑supply AI.
“Identical to ChatGPT was the second the entire world found AI, DeepSeek was the second the entire world found there was type of this open society,” he stated.
This story was initially featured on Fortune.com