The Memo - Special edition - xAI Grok 4 - Jul/2025
Grok 4 Heavy (multiple agents with tool use) scores HLE=44.4 and GPQA=88.9
To: US Govt, major govts, Microsoft, Apple, NVIDIA, Alphabet, Amazon, Meta, Tesla, Citi, Tencent, IBM, & 10,000+ more recipients…
From: Dr Alan D. Thompson <LifeArchitect.ai>
Sent: 9/Jul/2025
Subject: The Memo - AI that matters, as it happens, in plain English
AGI: 94%
ASI: 0/50 (no expected movement until post-AGI)
xAI announces Grok 4
Once again, we have this out to The Memo readers within just a few minutes of model announce.
A few days ago in my mid-2025 report (The sky is delivering, Jun/2025), I wrote:
It is likely that any model with a primary score at >50% on HLE, and a secondary score at >90% on GPQA is an ASI system, though no large language model system met this criteria as of mid-2025.
As of today, 9/Jul/2025, this criteria is very close to being met.

I would estimate the Grok 4 model to be around 5T parameters trained on 80T tokens (estimate only, centre point, based on workings in my GPT-5 and Grok papers). It is likely to be a mixture-of-experts model. Grok 4 Heavy means multiple Grok 4 agents with tool use working together to answer the test question.
Grok 4 Heavy on HLE=44.4%
Grok 4 Heavy on GPQA=88.9%
Grok 4 Heavy also scored 100% on the AIME25 math benchmark. You can read those very new questions and answers published online (6/Feb/2025).
Introduced as ‘the smartest AI in the world,’ and assuming that all benchmark scores are based on fair and genuine training and testing, I believe that Grok 4 Heavy should be considered proto-ASI. Read more: https://lifearchitect.ai/asi/
The testing team must have been working down-to-the-wire, as this Vending-Bench test result in the livestream shows test results from a model called grok-4-0709 (that’s today!). It is now in the #1 position on Vending-Bench, and its test results are more than 2.2× higher than Claude Opus 4, the previous state-of-the-art.
Grok 4 was trained with the equivalent compute of 300,000× NVIDIA H100. Rather than significantly increasing pre-training compute, priority was given to RL compute training. As a reminder, there are two main approaches used to train large language models:
Pretraining: a bit like imitation learning, or watch and repeat.
Reinforcement learning (RL), more like trial-and-error learning.
In The Memo edition 30/Jan/2025, we noted that RL mirrors the way prodigy children grow: by exploring, failing, and adapting. OpenAI described a similar viz that I created and published in The Memo edition 28/May/2025:

Note that the significantly increased performance of Grok 4 (especially in Grok 4 Heavy) comes from both the increased training, as well as the increased reasoning (thinking) test-time compute, where the model thinks for many minutes before providing a response.
See Grok 4 on the Models Table: https://lifearchitect.ai/models-table/
Watch the Grok 4 launch video: https://x.com/xai/status/1943158495588815072
Read my independent paper on Grok (Feb/2025): https://lifearchitect.ai/whats-in-grok/
Try Grok 4: https://grok.com/?referrer=x#subscribe
I livestreamed my analysis of this model (link):
All my very best,
Alan
LifeArchitect.ai