The Memo - 21/Apr/2024

Reka Core, Grok-1.5V, frontier model benchmarks GPQA + MMLU, and much more!

Apr 20, 2024

∙ Paid

To:      US Govt, major govts, Microsoft, Apple, NVIDIA, Alphabet, Amazon, Meta, Tesla, Citi, Tencent, IBM, & 10,000+ more recipients…
From:    Dr Alan D. Thompson <LifeArchitect.ai>
Sent:    21/Apr/2024
Subject: The Memo - AI that matters, as it happens, in plain English
AGI:     72%

OpenAI CEO (he says this a lot; May/2023 & 16/Apr/2024):
I do think that when we [have AGI/ASI and] look back at the standard of living and what we tolerate for people today [in 2024], it will look even worse than when we look back at how people lived 500 or 1,000 years ago. ‘Can you imagine that people lived in poverty? Can you imagine people suffered from disease? Can you imagine that everyone didn't have a phenomenal education and were able to live their lives however they wanted?’ It's going to look barbaric.

I’ve opened up my main testing prompts by putting them behind a ‘no index’ protected page. It’s quick and easy to use, and the password is provided in the title. Feel free to try this with your favorite model. Only the latest gpt-4-turbo-2024-04-09 model achieves 5/5, and many models struggle to format a table and score even 1/5! I’ll be refreshing the prompt shortly for the second half of 2024.

ALPrompt: https://lifearchitect.ai/ALPrompt/

The BIG Stuff

Ray Kurzweil says that technological progress is moving faster than his forecasts (12/Apr/2024)

Kurzweil is now suggesting that we’ll hit AGI by 2026. That would be very close to my countdown which shows AGI around 2025…

18m48s
I made a prediction in 1999 [2029 for AGI and 2045 for the Singularity]. It feels like we're two or three years ahead of that…
22m30s
[When we achieve artificial general intelligence, AGI, median human level in]
2026, we may not be able to understand everything going on, but we can understand it. Maybe it's like a hundred humans, but that's not beyond what we can comprehend.
[Artificial superintelligence, ASI in]
2045 will be like a million humans, and we can’t begin to understand that. So approximately at that time—we borrowed this phrase from physics and called it a ‘Singularity’.

Watch the interview on YouTube (18m48s timecode).

Reka Core(15/Apr/2024)

Commercial AI lab Reka AI has unveiled Reka Core, its largest and most capable multimodal language model to date. Core is competitive with leading industry models across key evaluation metrics. Its capabilities include image/video understanding, 128K context window, superb reasoning abilities, coding and agentic workflows, and fluency in 32 languages.

The paper provides no architecture details, but I’ve estimated this model at 300B parameters trained on 10T tokens. The performance is surprisingly poor—probably because they don’t have whatever secret sauce OpenAI et al are applying—scoring only 2/5 on my ALPrompt.

Read the announce.

Read the paper: https://publications.reka.ai/reka-core-tech-report.pdf (PDF, 21 pages)

Try it (login): https://chat.reka.ai/chat or via Poe.com (login): https://poe.com/RekaCore

Grok-1.5 Vision Preview (12/Apr/2024)

Grok-1.5V(ision) is x.ai's first-generation multimodal model that can process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok outperforms peers in the new RealWorldQA benchmark measuring real-world spatial understanding. Grok-1.5V will be available soon to early testers and existing Grok users.

Grok-1.5V is similar to other vision models: GPT-4V, Claude 3 Opus, Gemini Pro 1.5.

The Memo - Special edition: Llama 3

Dr Alan D. Thompson

Apr 18

Read full story

The Interesting Stuff

Exclusive: Scaling laws updates (Apr/2024)

Following the release of several new papers this month, I’ve updated my Chinchilla advisory note, expanding the literature references from the original tokens:parameters ratio of Chinchilla 20:1 to new findings from Mosaic at an average of 190:1, Tsinghua at 192:1, the Epoch AI replication at 26:1, and the latest findings from Llama 3 8B at 1,875:1.

In plain English (well, plain-ish for this nerd stuff), to match new findings from Mosaic and Tsinghua in particular, models should now be trained using 112× more data than used for GPT-3, and 6.6× more data than used for Llama 2.

In plainer English, and possibly oversimplified:

For the best performance, and at 2024 median model sizes and training budgets, if a large language model used to read 20M books during training, then it should instead read more than 190M books (or 40M books five times!).

GPQA benchmarks (12/Apr/2024)

My friends at Epoch AI have visualized the major frontier models using GPQA, a strict new ‘Google-proof’ benchmark by NYU, Anthropic, and Cohere.

[Sidenote: GPQA is designed by PhDs, and I’ve estimated that it measures IQ 115-135. The only benchmark above that is my BASIS suite, designed for IQ 170-190 (Nov/2023).]

Interestingly, current frontier models are only just breaching the 50% mark on GPQA in Q1 2024; a good sign of a rigorous test!

Claude 3 Opus GPQA=50.4%
gpt-4-turbo-2024-04-09 GPQA=46.5%

Source via Epoch AI.

MMLU benchmarks for open vs closed models (13/Apr/2024)

MMLU is an older standard test for AI and it no longer has a high enough ceiling to test frontier models in 2024. We are now scoring above 90% for several models (GPT-4, Gemini Ultra, Claude 3 Opus), and the test suite has a very high error rate (read an analysis of error rates in MMLU from Aug/2023, and watch my video on the ceiling issue from May/2023).

However, this chart was interesting to me because it shows the gap between open and closed models narrowing. Compare this month’s Mixtral 8x22B (Apr/2024) with Claude 2 (Jul/2023). If we were to interpret it generously, in terms of model performance, it looks like the best open models are now only ~9 months behind closed models…

MMLU scores for closed-source vs open-weight models to 2024. Click to enlarge.

Source via Twitter.

[Sidenote: The MMLU scores for this month’s gpt-4-turbo-2024-04-09 are unclear, but may have improved by 1% from 80.48% → 81.48% (12/Apr/2024).]

AI 50: The Top Artificial Intelligence Startups (11/Apr/2024)

FORBES: AI 50 EMPLOYEE COUNT PER AI LAB, 2023 → 2024

Forbes’ sixth annual AI 50 list shows AI startups are getting younger and leaner, with a median headcount of 89 employees, down from 150 last year. The list is dominated by 28 new entrants, many of which are selling AI infrastructure or applying AI to real-world use cases. AI is booming, with the AI 50 honorees raising a total of US$34.7B.

FORBES: THE HIGHEST VALUED COMPANIES ON THE 2024 AI 50

Among this year’s debut standouts is Mistral AI, a year-old startup which soared to a $2 billion valuation, per PitchBook, with its play to build a European rival to OpenAI. Named for a wind which blows from the French Alps into the Mediterranean that portends good weather, the Paris-based startups’ founders hail from AI research labs at Google and Meta. Mistral is just one of eight companies on the AI 50 that are headquartered in Europe…

Read summary via Forbes Australia.

Official report via Forbes.

Adobe adds Sora to Premiere Pro (16/Apr/2024)

Adobe, which released a demonstration of Sora being used to generate video in Premiere Pro, described the demonstration as an "experiment" and gave no timeline for when it would become available.

Policy

Regulation: Stanford AI Index Report 2024 (Apr/2024)

It’s not big news (you already know about this), but AI regulation has exploded in a big way. One of the key takeaways from the 2024 Stanford AI Index Report (above) was just how onerous new AI guidelines are:

The number of AI regulations in the United States sharply increased. The number of AI-related regulations in the U.S. has risen significantly in the past year and over the last five years. In 2023, there were 25 AI-related regulations, up from just one in 2016. Last year alone, the total number of AI-related regulations grew by 56.3%.

Download the report (PDF, 502 pages)

I come back to the Cato principles as addressed in The Memo edition 13/Nov/2023:

A thorough analysis of existing applicable regulations with consideration of both regulation and deregulation: Evaluating how current regulations apply to AI and balancing regulation with deregulation.
Prevent a patchwork, preemption of state and local laws: Advocating for a unified federal framework to avoid a patchwork of various state and local AI regulations.
Education over regulation: Improved AI and media literacy: Prioritizing public education about AI and media literacy over imposing regulations.
Consider the government’s position towards AI and protection of civil liberties: Scrutinizing the government's stance on AI to ensure the protection of civil liberties.

GPT-4 put this into very simple language for me:

Check current rules for AI: Look at the rules we already have and decide if we need more or less of them for AI.
Avoid mixed rules, have one big rule: Instead of having different AI rules in different places, have one big set of rules for everyone.
Teach people about AI, don't just make rules: Help people learn about AI and how media works, instead of just making lots of rules.
Think about how the government views AI and keeps freedoms safe: Make sure the government's use of AI doesn't take away people's rights and freedoms.

The Memo by LifeArchitect.ai