Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Debates about AI -Benchmarks – and how they are reported by Ki -Labor – become the public perspective in the public point of view.
This week an Openai employee accused Elon Musk’s Ki-Firma Xai, from the publication of misleading benchmark results for his latest AI model, Grok 3. One of the co-tints of Xai, Igor Babuschkin, insisted on it that the company was right.
The truth is somewhere in between.
In A Post on Xais blogThe company published a graphic with the performance of Grok 3 to Aime 2025, a collection of challenging mathematical questions from a recently carried out invitation mathematics examination. Some experts have Aime’s validity surveyed as Ki -Benchmark. Nevertheless, Aime 2025 and older versions of the test are often used to examine the mathematical ability of a model.
Xais diagram showed two variants of Grok 3, GROK 3 argumentation beta and grok 3 mini-argumentation, which defeated the best possible model. O3-mini highOn Aime 2025. But Openai employees on X quickly indicated that Xais diagram at “Cons@64” did not contain an O3-Mini-High 2025 points.
What is Cons@64, could you ask? Well, it is short for “Consensus@64”, and there is basically a model 64 to answer every problem in a benchmark and to answer the answers that are most frequently generated as final answers. As you can imagine, Cons@64 tends to increase the benchmark results of the models, and the skipping of a diagram can look like a model exceeds another if in reality is not the case.
Grok 3 Argumenting Beta and Grok 3 mini-argumentation points for Aime 2025 at “@1”-what the first score, which the models on the benchmark get under the score of O3-Mini-High. Grok 3 Argumenting Beta is always so easy to attribute behind Openai’s O1 model Adjustment to “medium” computing. But Xai is Advertising Grok 3 as “the smartest AI in the world”.
Pabushkin quarreled on x In the past, the Openai has published similarly misleading benchmark diagrams – albeit diagrams in which the performance of its own models is compared. A more neutral party in the debate put together a “more precise” diagram that shows almost the performance of each model at Cons@64:
Funny, as some people see my conspiracy as an attack on Openai and others as an attack on GROK, while it is actually Deepseek -Propaganda
(I actually believe that GROK looks good there and Openais TTC harassment behind O3-Mini*Hoch*-Pass@”” ”” ”deserves more exam.) https://t.co/djqljpcjh8 pic.twitter.com/3wh8fouf– TEORTAXES ▶ EUE (Deepseek Twitter🐋iron Powder 2023 – ∞) (@TEORTAXSTEX) February 20, 2025
But as AI researcher Nathan Lambert pointed out in a contributionPerhaps the most important metric remains a mystery: the arithmetic (and monetary) costs that it needed for each model to achieve its best score. This only shows how little AI benchmarks communicate about the restrictions of the models and their strengths.