To: US Govt, major govts, Microsoft, Apple, NVIDIA, Alphabet, Amazon, Meta, Tesla, Citi, Tencent, IBM, & 10,000+ more recipients…
From: Dr Alan D. Thompson <LifeArchitect.ai>
Sent: 12/Sep/2024
Subject: The Memo - AI that matters, as it happens, in plain English
AGI: 76% ➜ 81%
The BIG Stuff
OpenAI releases o1
Once again, we have this out to The Memo readers within just a few hours of model release.
Key points:
This is an extended reasoning model that ‘thinks’ before responding. The ‘thinking’ sometimes takes 20-30 seconds.
New highest MMLU score (o1=92.3 vs Claude 3.5S=88.7).
New highest GPQA score (o1=78.3 vs Claude 3.5S=67.2).
My initial testing shows this model outperforms all other models, and hits benchmark ceilings.
OpenAI o1 (reasoning model) consistently scores 100% in all ALPrompts. These were hardened prompts designed for frontier models. I hadn't expected the 2024 H2 version to be solved for a long time (prior to this, no LLM in Sep/2024 got a score of more than 2/5 for this prompt). I will be re-evaluating my life's work...
The model also hits the ‘uncontroversially correct’ ceilings on major benchmarks (GPQA Extended ceiling is 74%, MMLU ceiling is about 90%).
GPQA Diamond=78.3.
MMLU=90.8, 92.3 for final model.
Here’s a visualization of the distance between o1 and other models on major benchmarks. Note that there is nowhere left to go at the top; AI has now hit the human-comprehensible ceiling across standardized testing for 'smarts':
Read the announce: https://openai.com/index/introducing-openai-o1-preview/
Read the system card (no arch details): https://openai.com/index/openai-o1-system-card/
See also my Models table, and Timeline.
You can use it immediately within ChatGPT Plus (paid, login):
Playground: https://chatgpt.com/?model=o1-preview
I’ll be livestreaming about this model in about an hour from this email (link):
I’d like to invite you to gift a subscription to someone in your world who needs AI that matters, as it happens, in plain English:
All my very best,
Alan
LifeArchitect.ai