The Memo - Special edition - xAI Grok 4 - Jul/2025

Grok 4 Heavy (multiple agents with tool use) scores HLE=44.4 and GPQA=88.9

Jul 10, 2025

To:      US Govt, major govts, Microsoft, Apple, NVIDIA, Alphabet, Amazon, Meta, Tesla, Citi, Tencent, IBM, & 10,000+ more recipients…
From:    Dr Alan D. Thompson <LifeArchitect.ai>
Sent:    9/Jul/2025
Subject: The Memo - AI that matters, as it happens, in plain English
AGI:     94%
ASI:     0/50 (no expected movement until post-AGI)

xAI announces Grok 4

Once again, we have this out to The Memo readers within just a few minutes of model announce.

A few days ago in my mid-2025 report (The sky is delivering, Jun/2025), I wrote:

It is likely that any model with a primary score at >50% on HLE, and a secondary score at >90% on GPQA is an ASI system, though no large language model system met this criteria as of mid-2025.

As of today, 9/Jul/2025, this criteria is very close to being met.

https://lifearchitect.ai/models-table/#rankings. Click to enlarge.

I would estimate the Grok 4 model to be around 5T parameters trained on 80T tokens (estimate only, centre point, based on workings in my GPT-5 and Grok papers). It is likely to be a mixture-of-experts model. Grok 4 Heavy means multiple Grok 4 agents with tool use working together to answer the test question.

Grok 4 Heavy on HLE=44.4%
Grok 4 Heavy on GPQA=88.9%
Grok 4 Heavy also scored 100% on the AIME25 math benchmark. You can read those very new questions and answers published online (6/Feb/2025).

Introduced as ‘the smartest AI in the world,’ and assuming that all benchmark scores are based on fair and genuine training and testing, I believe that Grok 4 Heavy should be considered proto-ASI. Read more: https://lifearchitect.ai/asi/

The testing team must have been working down-to-the-wire, as this Vending-Bench test result in the livestream shows test results from a model called grok-4-0709 (that’s today!). It is now in the #1 position on Vending-Bench, and its test results are more than 2.2× higher than Claude Opus 4, the previous state-of-the-art.

Source: xAI Grok 4 launch livestream. Click to enlarge.

Grok 4 was trained with the equivalent compute of 300,000× NVIDIA H100. Rather than significantly increasing pre-training compute, priority was given to RL compute training. As a reminder, there are two main approaches used to train large language models:

Pretraining: a bit like imitation learning, or watch and repeat.
Reinforcement learning (RL), more like trial-and-error learning.

In The Memo edition 30/Jan/2025, we noted that RL mirrors the way prodigy children grow: by exploring, failing, and adapting. OpenAI described a similar viz that I created and published in The Memo edition 28/May/2025:

Viz available at: https://lifearchitect.ai/gpt-5/

Note that the significantly increased performance of Grok 4 (especially in Grok 4 Heavy) comes from both the increased training, as well as the increased reasoning (thinking) test-time compute, where the model thinks for many minutes before providing a response.

See Grok 4 on the Models Table: https://lifearchitect.ai/models-table/

Watch the Grok 4 launch video: https://x.com/xai/status/1943158495588815072

Read my independent paper on Grok (Feb/2025): https://lifearchitect.ai/whats-in-grok/

Try Grok 4: https://grok.com/?referrer=x#subscribe

I livestreamed my analysis of this model (link):

All my very best,

Alan
LifeArchitect.ai

Search | Archives

The Memo by LifeArchitect.ai

Discussion about this post