FOR IMMEDIATE RELEASE: 16/Dec/2023
Welcome back to The Memo.
You’re joining full subscribers from Vanguard, VMware, Vanenburg, Verizon, Vonage, and more.
I think this is the longest edition yet (3,700+ words, or 7½ printed pages). Let’s get started…
The BIG Stuff
DeepMind: LLMs can now produce new maths discoveries and solve real-world problems (14/Dec/2023)
DeepMind head of AI for science (14/Dec/2023 Guardian, MIT): ‘this is the first time that a genuine, new scientific discovery has been made by a large language model… It’s not in the training data—it wasn’t even known.’
the first time a new discovery has been made for challenging open problems in science or mathematics using LLMs. FunSearch discovered new solutions… its solutions could potentially be slotted into a variety of real-world industrial systems to bring swift benefits… the power of these models [tested with Codey PaLM 2 340B] can be harnessed not only to produce new mathematical discoveries, but also to reveal potentially impactful solutions to important real-world problems.
Yes, this moved the AGI needle from 61% → 64%.
Sidenote: In Feb/2007, fellow Aussie Prof Terry Tao called the cap set question his ‘favorite open question’. In Jun/2023 Terry also said that LLMs would take another three years to reach this level of progress (‘2026-level AI… will be a trustworthy co-author in mathematical research’). Read more about exponential growth (wiki).
Read the paper: https://doi.org/10.1038/s41586-023-06924-6
Gemini (6/Dec/2023)
We pushed out a special edition of The Memo to all 10,000+ readers for the release of Gemini; it was in your inbox within the first few hours of announcement.
It usually takes researchers several years to discover the capabilities of new models, and we’re still discovering cool new things about GPT-2 (2019) and GPT-3 (2020). Google DeepMind even flagged this phenomenon in the report: ‘Gemini can enable new approaches in areas like education, everyday problem solving, multilingual communication, information summarization, extraction, and creativity. We expect that the users of these models will find all kinds of beneficial new uses that we have only scratched the surface of in our own investigations.’
Here’s why I think Gemini is so important for AI models and the world.
More multimodal.
Inputs: text, image, audio, video.
Outputs: text, image.
Multilinguality. Trained on many languages.
Impressive benchmark performance, beating GPT-4 1760B across 30+ metrics.
Model sizes including on-device options (ready for phone, assistant, and humanoid).
I finished my Gemini annotated paper very recently, and it is now available to paid subscribers.
Microsoft argued (12/Dec/2023) that Google’s prompting and ‘best of 32x’ is the reason the Gemini benchmark scores are so good, especially on the MMLU benchmark where Gemini outperformed GPT-4. Microsoft has now used new prompting to achieve an even higher score for GPT-4 on MMLU.
MMLU results
90.10%: GPT-4 (Microsoft’s new testing)
90.04%: Gemini Ultra (Google’s testing)
89.8%: Human expert baseline (ASI)
86.4%: GPT-4 (OpenAI’s initial testing)
…
34.5%: Human average baseline (AGI)
All of this is completely moot given that the MMLU contains a lot of errors, so arguments about this level of overprecision (thanks to GPT-4 for finding that word for me) are… misguided at best. If you’d like to read more about this—including examples of where the MMLU rubric is just plain wrong—I can recommend ‘Errors in the MMLU: The Deep Learning Benchmark is Wrong Surprisingly Often’ by Daniel Erenrich (23/Aug/2023): https://archive.md/8lMxY
Read more about Microsoft’s new testing of GPT-4 on MMLU.
New models by Mistral (11/Dec/2023)
Mistral released two ground-breaking models.
Mistral Small, also known as Mixtral, also known as mixtral-8x7b-32kseqlen. Mistral says ‘Concretely, Mixtral has 45B total parameters but only uses 12B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12B model.‘
Mistral Medium, a new ‘prototype model’ that I’ve estimated at 180B parameters. It outperforms ChatGPT 20B and Llama 2 70B. MMLU=75.3% (GPT-3.5-turbo 20B=70%, Llama 2 70B=68.9%).
Read more about Mistral Small: https://mistral.ai/news/mixtral-of-experts/
Read more about Mistral Medium: https://mistral.ai/news/la-plateforme/
The best place for inexpensive inference of Mistral’s models is actually via competitor Together AI: https://www.together.ai/blog/mixtral
Sidenote: I laughed at this anonymous comment on HN (11/Dec/2023) making fun of the absurd buzzwords and silly model names we’re seeing:
Cheeseface just dropped the Blippy-7B model which is almost as good as the twinamp 34B model on the SwagCube benchmark when run locally as int8 and this shows that the gains made by the skibidi-70B model will probably filter down to the baseline Eras models in the next few weeks.
I hope my reports don’t read like this!
If you’d like to deep dive into ‘mixture of experts’ models, read the new HF walkthrough (Dec/2023): https://huggingface.co/blog/moe
If you’d like to understand how Transformers and large language models work, read the Financial Times walkthrough (Sep/2023):
https://ig.ft.com/generative-ai/
Google Imagen 2 - the cutting-edge of AI-generated art (13/Dec/2023)
Imagen 2 is Google DeepMind’s latest text-to-image diffusion model, capable of creating photorealistic images from textual prompts, designed for use by developers and featured in Google Arts and Culture experiments.
It does text well, works in multiple languages, is watermarked with SynthID, and seems to be the bleeding edge in text-to-image right now.
Read more: https://deepmind.google/technologies/imagen-2/
Available on Vertex AI, become a ‘trusted tester’: https://cloud.google.com/blog/products/ai-machine-learning/imagen-2-on-vertex-ai-is-now-generally-available
Optimus Gen 2 (13/Dec/2023)
‘Everything in this video is real, no CGI. All real time, nothing sped up. Incredible hardware improvements from the team.’
Read the tweet: https://twitter.com/julianibarz/status/1734759309077344737
Meet Ashley, the world’s first AI-powered political campaign caller (12/Dec/2023)
I am putting this in the ‘big stuff’ pile, because it is huge. On the surface, it looks like 20x LLMs, voice models, and other AI models stitched together. But look a little closer. This is a real-life illustration of the explosion we’ve been expecting, with very tangible effects and outcomes.
Ashley is introduced as the first artificial intelligence system designed to engage with voters for political campaigns.
…she is the first political phone banker powered by generative AI technology similar to OpenAI's ChatGPT. She is capable of having an infinite number of customized one-on-one conversations at the same time.
…Over the weekend, Ashley called thousands of Pennsylvania voters on behalf of Daniels. Like a seasoned campaign volunteer, Ashley analyzes voters' profiles to tailor conversations around their key issues. Unlike a human, Ashley always shows up for the job, has perfect recall of all of Daniels' positions, and does not feel dejected when she's hung up on.
"This is going to scale fast," said 30-year-old Ilya Mouzykantskii, the London-based CEO of Civox, the company behind Ashley. "We intend to be making tens of thousands of calls a day by the end of the year and into the six digits pretty soon. This is coming for the 2024 election and it's coming in a very big way. ... The future is now."
Mouzykantskii and his co-founder Adam Reis, former computer science students at Stanford and Columbia Universities respectively, declined to disclose the exact generative AI models they are using. They will only say they use over 20 different AI models, some proprietary and some open source
[Alan’s guess:
LLM: Meta AI Llama 2 derivative
LLM backup: OpenAI gpt-3.5-turbo (ChatGPT)
Document search: OpenAI text-embedding-ada-002 for context and profiling
Voice out: OpenAI TTS or Azure TTS
RAG: other (web search) for context and profiling
Voice in: OpenAI Whisper
Translation: Meta AI CoVoST (if needed)
Classifier: Meta AI FastText or similar to identify call sentiment
LLM: Mistral 7B for call summary
That’s only 9 models… And somehow they found uses for at least 12 more models to get to at least 21 total. For a phone call…]
Thanks to the latest generative AI technologies, Reis was able to build the product almost entirely on his own, whereas several years ago it would have taken a team of 50 engineers several years to do so, he said.
Read more via Reuters.
The Interesting Stuff
End of year AI report (16/Dec/2023)
I’m very happy with the end of year report, the latest in ‘The sky is’ series, and a warm ‘thank you’ to our technical reviewers. We’re making the report available early to full subscribers of The Memo. I appreciate your continued support of what you’ve told me is the most complete, grounded, and optimistic view of our current AI reality.
Watch out for the video coming soon, and you can be notified about that by clicking some buttons on YouTube.
You are welcome to share this report anywhere you’d like immediately, and it will be officially launched to the public around Christmas 2023.