Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Did xAI lie about Grok 3’s benchmarks?


Debates about AI -Benchmarks – and how they are reported by Ki -Labor – become the public perspective in the public point of view.

This week an Openai employee accused Elon Musk’s Ki-Firma Xai, from the publication of misleading benchmark results for his latest AI model, Grok 3. One of the co-tints of Xai, Igor Babuschkin, insisted on it that the company was right.

The truth is somewhere in between.

In A Post on Xais blogThe company published a graphic with the performance of Grok 3 to Aime 2025, a collection of challenging mathematical questions from a recently carried out invitation mathematics examination. Some experts have Aime’s validity surveyed as Ki -Benchmark. Nevertheless, Aime 2025 and older versions of the test are often used to examine the mathematical ability of a model.

Xais diagram showed two variants of Grok 3, GROK 3 argumentation beta and grok 3 mini-argumentation, which defeated the best possible model. O3-mini highOn Aime 2025. But Openai employees on X quickly indicated that Xais diagram at “Cons@64” did not contain an O3-Mini-High 2025 points.

What is Cons@64, could you ask? Well, it is short for “Consensus@64”, and there is basically a model 64 to answer every problem in a benchmark and to answer the answers that are most frequently generated as final answers. As you can imagine, Cons@64 tends to increase the benchmark results of the models, and the skipping of a diagram can look like a model exceeds another if in reality is not the case.

Grok 3 Argumenting Beta and Grok 3 mini-argumentation points for Aime 2025 at “@1”-what the first score, which the models on the benchmark get under the score of O3-Mini-High. Grok 3 Argumenting Beta is always so easy to attribute behind Openai’s O1 model Adjustment to “medium” computing. But Xai is Advertising Grok 3 as “the smartest AI in the world”.

Pabushkin quarreled on x In the past, the Openai has published similarly misleading benchmark diagrams – albeit diagrams in which the performance of its own models is compared. A more neutral party in the debate put together a “more precise” diagram that shows almost the performance of each model at Cons@64:

But as AI researcher Nathan Lambert pointed out in a contributionPerhaps the most important metric remains a mystery: the arithmetic (and monetary) costs that it needed for each model to achieve its best score. This only shows how little AI benchmarks communicate about the restrictions of the models and their strengths.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *