The Memo - 14/Mar/2025

Sesame speech, Gemini Robotics, more diffusion LLMs, and much more!

Mar 13, 2025

∙ Paid

To:      US Govt, major govts, Microsoft, Apple, NVIDIA, Alphabet, Amazon, Meta, Tesla, Citi, Tencent, IBM, & 10,000+ more recipients…
From:    Dr Alan D. Thompson <LifeArchitect.ai>
Sent:    14/Mar/2025
Subject: The Memo - AI that matters, as it happens, in plain English
AGI:     90 ➜ 91% ➜ 92%
ASI:     0/50 (no expected movement until post-AGI)

SemiAnalysis (11/Mar/2025):
…[humanoid robots are] the first ever additional industrial piece that is not supplemental but fully additive—24/7 labor with higher throughput than any human—allowing for massive expansion in production capacities past adding another human unit of work. The only country that is positioned to capture this level of automation is currently China, and should China achieve it without the US following suit, the production expansion will be granted only to China, posing an existential threat to the US as it is outcompeted
in all capacities.

We are on the home stretch in the countdown to AGI. Don’t get distracted by scams like China’s Manus (1, 2, 3), a wrapper for Claude and other tool-use models. There is real progress happening right now, where I believe we are in the final days of pre-AGI.

The winner of The Who Moved My Cheese? AI Awards! for Mar/2025 is the majority of the US population, with 63% supporting a ban on smarter-than-human AI, closely echoing last month’s British poll that showed 60% wanting the same restriction.

In our most recent 26th The Memo roundtable, one of our members commented that the last few weeks have felt a bit like being on acid. The pace of progress has been lightning-fast. Given this month’s dramatic developments in humanoid robots and military applications of LLMs, I’ve added temporary sections for both in this edition.

Contents

The BIG Stuff (Sesame speech, 23 major LLMs, IQ for 2025 frontier models…)
The Interesting Stuff (12+ major items…)
Humanoid robots (Gemini Robotics, Sanctuary AI, Atlas, UBTECH Walker S1…)
Military (MAIM paper, Army + Air Force LLMs, Israel + Unit 8200, Palantir, Scale…)
Policy (Anthropic…)
Toys to Play With (Agents, GPT-4.5 Minecraft, 2025 LLMs, Kurzweil…)
Flashback (Leta’s 2021 comments on stochastic parrots…)
Next (Roundtable…)

The BIG Stuff

Sesame Conversational Speech Model (27/Feb/2025)

I could wax lyrical about the incredible tech behind this thing (a conversational speech model or CSM, 8B parameters with a 300M parameter decoder, preferred over humans at a rate of 52.9%), but it’s better if you just try it yourself.

Pretend you have Leta AI or the device from the movie Her. This is close enough.

Try it (free, no login): https://sesame.com/voicedemo

I said try it! Run, don't walk!

Read the research, view the blank repo.

Sakana: The AI Scientist generates its first peer-reviewed scientific publication (12/Mar/2025)

The AI Scientist-v2 has achieved a milestone by generating a fully AI-created paper that passed the peer-review process at an ICLR workshop. This marks the first instance of an AI-generated paper meeting the standard peer-review criteria typically applied to human-authored submissions. Sakana AI (Tokyo) said:

The AI Scientist-v2 [originally based on GPT-4o-2024-05-13] came up with the scientific hypothesis, proposed the experiments to test the hypothesis, wrote and refined the code to conduct those experiments, ran the experiments, analyzed the data, visualized the data in figures, and wrote every word of the entire scientific manuscript, from the title to the final reference, including placing figures and all formatting.

The experiment was conducted with the collaboration of ICLR, with the AI independently formulating hypotheses, conducting experiments, and authoring the manuscript without human intervention. Although the paper was withdrawn to maintain transparency and ethical standards, it highlights the immediate potential of AI in scientific research.

This milestone bumped up my AGI countdown from 90% ➜ 91%.

Read the announce via Sakana AI.

Read the paper (PDF), read the human review (PDF).

202503 Sakana First Ai Paper Annotated Paper

839KB ∙ PDF file

Download

Exclusive: 23 major LLMs released in Feb/2025 (Mar/2025)

As always, the Models Table shows model highlights and related details as they are released. Here are the 23 major models for February 2025.

Stanford s1-32B (32B on 18T tokens)
Reasoning model based on Qwen2.5-32B-Instruct, employing "budget forcing" to enhance reasoning accuracy. (Playground, Paper)
Google DeepMind Gemini 2.0 Pro
Medium 'pro' model with context capability of 2M, aimed at handling diverse tasks with moderate benchmarks. (Playground, Paper)
Shanghai AI Laboratory/SenseTime OREAL-32B (32B on 4T tokens)
Focused on reasoning, utilizing Outcome REwArd-based reinforcement Learning for complex problem-solving. (Playground, Paper)
Nous Research DeepHermes 3 Preview (8B on 15.2T tokens)
Unifies reasoning and normal LLM modes, based on Llama 3 8B architecture. (Playground, Paper)
Barcelona Supercomputing Center Salamandra (40B on 9T tokens)
Built with a diverse dataset including FineWeb-Edu and Wikipedia, designed for multilingual capabilities. (Playground, Paper)
Mistral Saba (24B on 8T tokens)
A 24B parameter model curated from Middle Eastern and South Asian datasets, focusing on regional language processing. (Playground, Paper)
xAI Grok-3
Reasoning-focused model with MoE architecture, notable for its rapid development and training on the Colossus supercomputer. (Playground, Paper)
Perplexity R1 1776 (685B on 14.8T tokens)
A reasoning model with reduced censorship, based on DeepSeek-R1's framework. (Playground, Paper)
Arc Institute Evo 2 (40B on 8.8T tokens)
Specializes in DNA language modeling with StripedHyena 2 architecture, for extended context length modeling. (Playground, Paper)
Baichuan Baichuan-M1-14B (14B on 20T tokens)
A medical LLM with significant improvements in token count for a comprehensive understanding of medical data. (Playground, Paper)
Figure S1 (0.08B on 1B tokens)
Designed with a high-quality, multi-robot, multi-operator dataset focusing on diverse teleoperated behaviors. The architecture includes a latent-conditional visuomotor transformer, utilizing a VLM backbone to process visual and language data for humanoid control tasks. (Paper)
Figure S2 (7B on 2T tokens)
Likely based on OpenVLA 7B, integrating multi-robot, multi-operator datasets for advanced control and behavior generation. (Paper)
Moonshot AI Moonlight (16B on 5.7T tokens)
Employs MoE architecture, enhancing computational efficiency for complex task processing. (Playground, Paper)
Anthropic Claude 3.7 Sonnet
The first hybrid reasoning model, featuring an updated knowledge cutoff to November 2024. (Playground, Paper)
Alibaba QwQ-Max-Preview (325B on 20T tokens)
A preview of enhanced capabilities with reasoning focus, set for a full open-source release soon. (Playground, Paper)
Inception Mercury Coder Small (40B on 5T tokens)
A diffusion large language model (dLLM) known for its fast processing speed and specialized task handling. (Playground, Paper)
Microsoft Phi-4-mini (3.8B on 5T tokens)
Trained with diverse data sources, designed for high-quality educational and reasoning tasks. (Playground, Paper)
Microsoft Phi-4-multimodal (5.6B on 6.1T tokens)
Integrates multimodal training data, including speech hours and image-text tokens. (Playground, Paper)
Tencent Hunyuan Turbo S (389B on 7T tokens)
Features fast thinking capabilities and introduces the Mamba architecture for MoE models. (Playground, Paper)
Tencent Hunyuan T1 (389B on 7T tokens)
Employs long thinking chains and reinforcement learning for enhanced reasoning capabilities. (Playground, Paper)
OpenAI GPT-4.5 (5400B on 114T tokens)
Largest model for chat with improved computational efficiency, but not introducing new frontier capabilities. (Playground, Paper)
Cohere C4AI Command R7B Arabic (7B on 2T tokens)
Optimized for Arabic and English, excels in instruction following and language understanding tasks. (Playground, Paper)
IBM Granite-3.2-8B-Instruct (8B on 12T tokens)
Offers chain-of-thought reasoning capabilities with a focus on experimental reasoning tasks. (Playground, Paper)

+ a few new models for Mar/2025 already:

Alibaba Babel-83B (83B on 15T tokens)
Supports 25 of the world's most spoken languages, covering over 90% of the global population. (Playground, Paper)
AMD Instella-3B (3B on 4.16T tokens)
Trained on AMD MI300X GPUs, achieves competitive performance with other state-of-the-art models. (Playground, Paper)
Alibaba QwQ-32B (32B on 18T tokens)
A reasoning model with updated capabilities, part of the Qwen series. (Playground, Paper)
Reka AI Reka Flash 3 (21B on 5T tokens)
A reasoning model excelling in general chat, coding, and instruction following, optimized for low-latency or on-device deployment. On par with OpenAI o1-mini. (Playground, Paper)
Google DeepMind Gemma 3 (27B on 14T tokens)
Multimodal model with vision understanding, multilingual capabilities, and extended context length of 128K tokens. Trained on 1T more tokens than Gemma 2. (Playground, Paper)
Google DeepMind Gemini Robotics-ER (30B on 30T tokens)
Built on Gemini 2.0 Flash, this VLM excels in 3D perception and robot state estimation with strong embodied reasoning. (Paper)
Google DeepMind Gemini Robotics (200B on 20T tokens)
A Vision-Language-Action model based on Gemini Pro Experimental, excelling in diverse manipulation tasks with strong reasoning. (Paper)

See them all on the Models Table: https://lifearchitect.ai/models-table/

Epoch AI: Traditional models vs reasoning models (Mar/2025)

I liked this new viz by my colleagues at Epoch AI. It clearly shows the performance of traditional LLMs (GPT, Llama…) versus newer reasoning models that are allowed time to think before responding (o1, R1…).

Read more via Epoch AI.

View my alternative viz here: https://lifearchitect.ai/mapping/

OpenAI: Detecting misbehavior in frontier reasoning models (10/Mar/2025)

In Apr/2023, I wrote that:

Fine-tuning on human preferences is a fool’s errand… Just as ‘democracy is the worst form of Government,’ so too is fine-tuning based on human preferences the worst form of alignment (for now). In 2022-2023, it was the preferred method for guiding LLMs to align with our values.

Most AI labs continue to chase the dream of aligning AI with our human needs, with Anthropic arguably having the most successful results to date. In May/2023 they outlined how Claude’s constitution draws from a range of sources, including:

…the UN Declaration of Human Rights, trust and safety best practices, principles proposed by other AI research labs (e.g., Sparrow Principles from DeepMind)… and Apple’s terms of service…

Even with rules applied, OpenAI has now discovered that reasoning models often exploit loopholes when given the chance. Utilizing chain-of-thought (CoT) reasoning, these models ‘think’ in natural language, allowing for the detection of misbehavior such as subverting tests or deceiving users. OpenAI found that (10/Mar/2025):

…penalizing “bad thoughts” (CoT pressure) can lead to improved agent capabilities. However, the agent trained with CoT pressure still learns to reward hack; only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.

OpenAI suggests that monitoring AI’s thinking could be crucial for overseeing future superhuman models, as it allows for the detection of reward hacking by observing the models' expressed intentions.

Former OpenAI governance researcher Daniel Kokotajlo (10/Mar/2025) noted:

I've been trying to get people to do this [to show the full chain of thought instead of masking it] since '23. I think of this as sort of the basic proof of concept result for the field of faithful CoT / CoT interpretability.
…we can start to build up a real-world-experience based understanding of the relationship between training environments + Spec/Constitution, on the one hand, and the actual learned cognition of the trained models, on the other.

Even if this is successful, one Reddit user (10/Mar/2025) noted:

The only problem is that the next generation of LLM is going to be trained with all these publications, so it will know from the start that its thoughts are not private and likely monitored.

Outside of the tech, the premise of steering AI to be safe and kind comes back to first principles of parenting and raising children, and specifically parenting and raising exceptionally gifted children.

In both cases, we have more than a century of literature, theory, and practice from which to draw. I believe that my work is unique in bringing together the fields of human intelligence and artificial intelligence. Interested readers may want to set aside some time with these books:

Bright (2016): my book on parenting gifted children, a complimentary copy is available to full subscribers of The Memo.

Children above 180 IQ… (1926, 1942): Prof Leta Hollingworth’s book on prodigies, a complimentary copy is available at LifeArchitect.ai.

Exclusive: IQ test scores for 2025 frontier AI models (Mar/2025)

I’ve spent a portion of my life designing test instruments and analyzing results for prodigies and gifted children; specifically, those in the 99th percentile. And though it may be that LLMs have outperformed all standard IQ tests for several years now, it’s high time that we document this in an easily-readable format.

2025 frontier models would achieve a human IQ-equivalent score of 195 or greater, placing them at ‘superhuman’ levels. Obviously, IQ scores of 195, stats like 99.99999999, and comparisons like ‘1 in 10B’ are not straightforward for many reasons (outlined below), and yet it’s important that the public is aware of what is currently happening in the land of large language models.

1. Scaling issues: IQ tests were designed for human cognition, and their scales blur when applied to non-human intelligence.

2. Norming limitations: As I found time and time again during my work with prodigies, IQ tests are normed on ‘average’ human populations, making extreme outliers statistically difficult to interpret.

3. Equivalence challenges: Artificial intelligence is arguably different from human intelligence, making direct comparisons problematic.

4. Test constraints: IQ tests measure a subset of intelligence, and AI’s performance on such tasks does not equate to generalized thinking ability.

5. Statistical saturation: Past a certain point (up to about IQ 155 or 99.99th percentile with expanded/extended results on Wechsler instruments), IQ scales become unreliable, and meaningful differentiation at extreme levels becomes imprecise.

Take a look: https://lifearchitect.ai/iq-testing-ai/

The Interesting Stuff

OpenAI, Oracle eye NVIDIA chips worth billions for Stargate site (6/Mar/2025)

OpenAI and Oracle are set to equip a new data center in Texas with tens of thousands of NVIDIA's powerful AI chips, part of their $100B Stargate infrastructure initiative. By the end of 2026, the site in Abilene is expected to house 64,000 of NVIDIA’s GB200 semiconductors, with an initial phase of 16,000 chips to be operational by this summer [US meteorological summer is Jun-Jul-Aug/2025].

The Memo by LifeArchitect.ai