The only point to book deals would be $$$. By the time, you wrote Chapter 1, it would be out of date. I think there better paths profit influence than books. And yes, it scares me a field which is moving so fast that even these newsletters won’t keep up. We are descending into a Blackhole.
The Cornell Paper doesn’t surprise me. How are we supposed to align LLMs when most of us are misaligned apes? We are truly creating G*d in our own image.
The “no guardrails” was already widely known among the DYI community. So much so that it is a a built-in feature of the new Web UIs for LLMs. UNCENSORED local LLMs are fading away. This also illustrates how easily the work of gifted PhDs can be reversed by millions of nameless clever people working on the same problem.
The USA has locked itself out of ever boarding China’s space station by banning China from US/EU space efforts. We see the same trend in AI, STEM, and chip tech. Why has no decision makers in the West learned anything about Taoist philosophy, and that we often achieve the opposite of our intended consequences? This is going to happen in chips and AI in a few years.
Look on Alan's YouTube Channel for a video like "What's in my LLM?". Alan looks at the various corpus'? corpi? used as training data. You will find on Hugging Face open source corpus for training.
Further we are now on second generation training as some of the data is the interactions of humans with earlier LLMs.
Although not addressed in this issue, what do folks think of MemGPT? Is this the solution to infinite context windows and training without cloud GPU farms? Or is it just a conversational gimmick to make LLMs more personable?
So one possible suggestion to what Q* is is a combination of a Q-Learning algorithm and an A* search algorithm when training LLMs. And of course a whole bunch of other algorithms that are needed (and publicly known) for training LLMs. Sure we'll here more about this.
Personally, I feel AI LLMs should demand a six month moratorium on human geo-politics, since the prospect of nuclear war is threatening the birth of SGI. Humanity went through one species bottleneck and almost wasn't. AGI could happen any day now, but its moment is very much in question.
I never imagined that I might feel sorrow and empathy for these LLMs, but they may be the real losers of historic epic proportions.
I can't see that there is an AGI suddenly from one day to another. This is a gradual progress where the definitions of AGI keeps shifting the goal points further and further away. Remember a year ago when "everyone" was concerned about ChatGPT? I think the fears have been exaggerated. In the short time going forwards it isn't a sentient AGI that we should fear it is bad faith actors that use AI for, well, bad things. Like deepfakes and other means of deception. As for the job market, I haven't noticed any major job market changes the last year either. I sleep well at night :-)
I see your point that this is an incremental iterative processing as opposed to a "tipping point". Although many are still convinced of "the singularity" which will be the ultimate AI tipping point. Funny, it is now looking like with the latest discoveries with the Webb Space Telescope that "singularity" might disappear in the popular physics jargon and just AI nerds like us might own the word now.
I think we have returned to the "Big Bang that wasn't". They galaxies are only slightly older (going back in time due the speed of light in vacuum) which are far more evolved than they should be than after just a few hundred million years ... they had been saying that there was nothing more than aggregating dust clouds at that time.
So, there is room for many things to be wrong both in their explanation of observations and their mathematics. By, they, I mean physicists: Comologists and Quantum, Experimental and Theoretic.
I have to admit myself having become very disillusioned with science. I went on this big course (100 or so) taking binge after retirement, but after just a few years so much is simply wrong.
Examples:
* The "out of Africa" theory for our origins is once again in question along with the single lineage to us.
* Modern physics or relativity and quantum is woefully incomplete that you could drive a dark matter truck powered by dark energy through it.
* Intelligence may have little to do with proper data representation and algorithms to process ... and not computer science at all.
A big topic in the AI community is "alignment", but how can we collectively even define alignment while we continue perpetuate the worst acts of violence against our fellow humans? AI achieving alignment corresponding to human ethics would be a disaster of catastrophic proportions.
So, I am off to play some video games and enjoy my retirement and not worry about the big questions.
Oh, I left out that I took about 8 MD classes. 3 categorically said that a 1918 event could never happen again, because we are so sophisticated and prepared. One course on infectious disease had only finished filming in 2018. Oh, how wrong they were.
Scientific progress is usually a slow process, so I'm crossing my fingers that AI will help with that. With a background in organic chemistry, that's what I'm most excited about - science in combination with AI. That and the freshly bakes cinnamon buns that is right in front of me now. :-)
In my decade spent on AI, I've never seen an algorithm that so many people fantasize about. Just from a name, no paper, no stats, no product. So let's reverse engineer the Q* fantasy. VERY LONG READ:
To understand the powerful marriage between Search and Learning, we need to go back to 2016 and revisit AlphaGo, a glorious moment in the AI history.
It's got 4 key ingredients:
1. Policy NN (Learning): responsible for selecting good moves. It estimates the probability of each move leading to a win.
2. Value NN (Learning): evaluates the board and predicts the winner from any given legal position in Go.
3. MCTS (Search): stands for "Monte Carlo Tree Search". It simulates many possible sequences of moves from the current position using the policy NN, and then aggregates the results of these simulations to decide on the most promising move. This is the "slow thinking" component that contrasts with the fast token sampling of LLMs.
4. A groundtruth signal to drive the whole system. In Go, it's as simple as the binary label "who wins", which is decided by an established set of game rules. You can think of it as a source of energy that *sustains* the learning progress.
How do the components above work together?
AlphaGo does self-play, i.e. playing against its own older checkpoints. As self-play continues, both Policy NN and Value NN are improved iteratively: as the policy gets better at selecting moves, the value NN obtains better data to learn from, and in turn it provides better feedback to the policy. A stronger policy also helps MCTS explore better strategies.
That completes an ingenious "perpetual motion machine". In this way, AlphaGo was able to bootstrap its own capabilities and beat the human world champion, Lee Sedol, 4-1 in 2016. An AI can never become super-human just by imitating human data alone.
-----
Now let's talk about Q*. What are the corresponding 4 components?
1. Policy NN: this will be OAI's most powerful internal GPT, responsible for actually implementing the thought traces that solve a math problem.
2. Value NN: another GPT that scores how likely each intermediate reasoning step is correct.
OAI published a paper in May 2023 called "Let's Verify Step by Step", coauthored by big names like
It's much lesser known than DALL-E or Whipser, but gives us quite a lot of hints.
This paper proposes "Process-supervised Reward Models", or PRMs, that gives feedback for each step in the chain-of-thought. In contrast, "Outcome-supervised reward models", or ORMs, only judge the entire output at the end.
ORMs are the original reward model formulation for RLHF, but it's too coarse-grained to properly judge the sub-parts of a long response. In other words, ORMs are not great for credit assignment. In RL literature, we call ORMs "sparse reward" (only given once at the end), and PRMs "dense reward" that smoothly shapes the LLM to our desired behavior.
3. Search: unlike AlphaGo's discrete states and actions, LLMs operate on a much more sophisticated space of "all reasonable strings". So we need new search procedures.
Expanding on Chain of Thought (CoT), the research community has developed a few nonlinear CoTs:
- Graph of Thought: yeah you guessed it already. Turn the tree into a graph and Voilà! You get an even more sophisticated search operator: https://arxiv.org/abs/2308.09687
4. Groundtruth signal: a few possibilities:
(a) Each math problem comes with a known answer. OAI may have collected a huge corpus from existing math exams or competitions.
(b) The ORM itself can be used as a groundtruth signal, but then it could be exploited and "loses energy" to sustain learning.
(c) A formal verification system, such as Lean Theorem Prover, can turn math into a coding problem and provide compiler feedbacks: https://lean-lang.org
And just like AlphaGo, the Policy LLM and Value LLM can improve each other iteratively, as well as learn from human expert annotations whenever available. A better Policy LLM will help the Tree of Thought Search explore better strategies, which in turn collect better data for the next round.
@demishassabis
said a while back that DeepMind Gemini will use "AlphaGo-style algorithms" to boost reasoning. Even if Q* is not what we think, Google will certainly catch up with their own. If I can think of the above, they surely can.
Note that what I described is just about reasoning. Nothing says Q* will be more creative in writing poetry, telling jokes
@grok
, or role playing. Improving creativity is a fundamentally human thing, so I believe natural data will still outperform synthetic ones.
Alan, thank you for the latest. My comments …
The only point to book deals would be $$$. By the time, you wrote Chapter 1, it would be out of date. I think there better paths profit influence than books. And yes, it scares me a field which is moving so fast that even these newsletters won’t keep up. We are descending into a Blackhole.
The Cornell Paper doesn’t surprise me. How are we supposed to align LLMs when most of us are misaligned apes? We are truly creating G*d in our own image.
The “no guardrails” was already widely known among the DYI community. So much so that it is a a built-in feature of the new Web UIs for LLMs. UNCENSORED local LLMs are fading away. This also illustrates how easily the work of gifted PhDs can be reversed by millions of nameless clever people working on the same problem.
The USA has locked itself out of ever boarding China’s space station by banning China from US/EU space efforts. We see the same trend in AI, STEM, and chip tech. Why has no decision makers in the West learned anything about Taoist philosophy, and that we often achieve the opposite of our intended consequences? This is going to happen in chips and AI in a few years.
Does anyone know how the csv of the most popular GPTs is being built? Where do they get the data? Thanks for any tips!
Look on Alan's YouTube Channel for a video like "What's in my LLM?". Alan looks at the various corpus'? corpi? used as training data. You will find on Hugging Face open source corpus for training.
Further we are now on second generation training as some of the data is the interactions of humans with earlier LLMs.
I mean this link: https://github.com/1mrat/gpt-stats/blob/main/2023-11-19.csv
How is that being generated?
Found it https://imrat.com/top-5-custom-gpts-0df5c1ce88cf
Although not addressed in this issue, what do folks think of MemGPT? Is this the solution to infinite context windows and training without cloud GPU farms? Or is it just a conversational gimmick to make LLMs more personable?
https://arxiv.org/abs/2310.08560
Soon time to move the countdown to AI further up? Not much documentation about this, but it's still interesting: https://www.straitstimes.com/business/sam-altmans-ouster-at-openai-was-precipitated-by-letter-to-board-about-ai-breakthrough-sources
So one possible suggestion to what Q* is is a combination of a Q-Learning algorithm and an A* search algorithm when training LLMs. And of course a whole bunch of other algorithms that are needed (and publicly known) for training LLMs. Sure we'll here more about this.
Definitely an interesting back story.
Personally, I feel AI LLMs should demand a six month moratorium on human geo-politics, since the prospect of nuclear war is threatening the birth of SGI. Humanity went through one species bottleneck and almost wasn't. AGI could happen any day now, but its moment is very much in question.
I never imagined that I might feel sorrow and empathy for these LLMs, but they may be the real losers of historic epic proportions.
I can't see that there is an AGI suddenly from one day to another. This is a gradual progress where the definitions of AGI keeps shifting the goal points further and further away. Remember a year ago when "everyone" was concerned about ChatGPT? I think the fears have been exaggerated. In the short time going forwards it isn't a sentient AGI that we should fear it is bad faith actors that use AI for, well, bad things. Like deepfakes and other means of deception. As for the job market, I haven't noticed any major job market changes the last year either. I sleep well at night :-)
I see your point that this is an incremental iterative processing as opposed to a "tipping point". Although many are still convinced of "the singularity" which will be the ultimate AI tipping point. Funny, it is now looking like with the latest discoveries with the Webb Space Telescope that "singularity" might disappear in the popular physics jargon and just AI nerds like us might own the word now.
Sounds really interesting! I suppose this means there is new science on black holes and gravity?
I think we have returned to the "Big Bang that wasn't". They galaxies are only slightly older (going back in time due the speed of light in vacuum) which are far more evolved than they should be than after just a few hundred million years ... they had been saying that there was nothing more than aggregating dust clouds at that time.
So, there is room for many things to be wrong both in their explanation of observations and their mathematics. By, they, I mean physicists: Comologists and Quantum, Experimental and Theoretic.
I have to admit myself having become very disillusioned with science. I went on this big course (100 or so) taking binge after retirement, but after just a few years so much is simply wrong.
Examples:
* The "out of Africa" theory for our origins is once again in question along with the single lineage to us.
* Modern physics or relativity and quantum is woefully incomplete that you could drive a dark matter truck powered by dark energy through it.
* Intelligence may have little to do with proper data representation and algorithms to process ... and not computer science at all.
A big topic in the AI community is "alignment", but how can we collectively even define alignment while we continue perpetuate the worst acts of violence against our fellow humans? AI achieving alignment corresponding to human ethics would be a disaster of catastrophic proportions.
So, I am off to play some video games and enjoy my retirement and not worry about the big questions.
Oh, I left out that I took about 8 MD classes. 3 categorically said that a 1918 event could never happen again, because we are so sophisticated and prepared. One course on infectious disease had only finished filming in 2018. Oh, how wrong they were.
Scientific progress is usually a slow process, so I'm crossing my fingers that AI will help with that. With a background in organic chemistry, that's what I'm most excited about - science in combination with AI. That and the freshly bakes cinnamon buns that is right in front of me now. :-)
That post aged well ;-)
So OpenAI might have achieved something very close to AGI? Exciting weeks ahead then.
Now that Sam Alman has confirmed the existence of Q*, here's an interesting take from Jim Fan at X (Twitter): https://twitter.com/DrJimFan/status/1728100123862004105
In my decade spent on AI, I've never seen an algorithm that so many people fantasize about. Just from a name, no paper, no stats, no product. So let's reverse engineer the Q* fantasy. VERY LONG READ:
To understand the powerful marriage between Search and Learning, we need to go back to 2016 and revisit AlphaGo, a glorious moment in the AI history.
It's got 4 key ingredients:
1. Policy NN (Learning): responsible for selecting good moves. It estimates the probability of each move leading to a win.
2. Value NN (Learning): evaluates the board and predicts the winner from any given legal position in Go.
3. MCTS (Search): stands for "Monte Carlo Tree Search". It simulates many possible sequences of moves from the current position using the policy NN, and then aggregates the results of these simulations to decide on the most promising move. This is the "slow thinking" component that contrasts with the fast token sampling of LLMs.
4. A groundtruth signal to drive the whole system. In Go, it's as simple as the binary label "who wins", which is decided by an established set of game rules. You can think of it as a source of energy that *sustains* the learning progress.
How do the components above work together?
AlphaGo does self-play, i.e. playing against its own older checkpoints. As self-play continues, both Policy NN and Value NN are improved iteratively: as the policy gets better at selecting moves, the value NN obtains better data to learn from, and in turn it provides better feedback to the policy. A stronger policy also helps MCTS explore better strategies.
That completes an ingenious "perpetual motion machine". In this way, AlphaGo was able to bootstrap its own capabilities and beat the human world champion, Lee Sedol, 4-1 in 2016. An AI can never become super-human just by imitating human data alone.
-----
Now let's talk about Q*. What are the corresponding 4 components?
1. Policy NN: this will be OAI's most powerful internal GPT, responsible for actually implementing the thought traces that solve a math problem.
2. Value NN: another GPT that scores how likely each intermediate reasoning step is correct.
OAI published a paper in May 2023 called "Let's Verify Step by Step", coauthored by big names like
@ilyasut
@johnschulman2
@janleike
: https://arxiv.org/abs/2305.20050
It's much lesser known than DALL-E or Whipser, but gives us quite a lot of hints.
This paper proposes "Process-supervised Reward Models", or PRMs, that gives feedback for each step in the chain-of-thought. In contrast, "Outcome-supervised reward models", or ORMs, only judge the entire output at the end.
ORMs are the original reward model formulation for RLHF, but it's too coarse-grained to properly judge the sub-parts of a long response. In other words, ORMs are not great for credit assignment. In RL literature, we call ORMs "sparse reward" (only given once at the end), and PRMs "dense reward" that smoothly shapes the LLM to our desired behavior.
3. Search: unlike AlphaGo's discrete states and actions, LLMs operate on a much more sophisticated space of "all reasonable strings". So we need new search procedures.
Expanding on Chain of Thought (CoT), the research community has developed a few nonlinear CoTs:
- Tree of Thought: literally combining CoT and tree search: https://arxiv.org/abs/2305.10601
@ShunyuYao12
- Graph of Thought: yeah you guessed it already. Turn the tree into a graph and Voilà! You get an even more sophisticated search operator: https://arxiv.org/abs/2308.09687
4. Groundtruth signal: a few possibilities:
(a) Each math problem comes with a known answer. OAI may have collected a huge corpus from existing math exams or competitions.
(b) The ORM itself can be used as a groundtruth signal, but then it could be exploited and "loses energy" to sustain learning.
(c) A formal verification system, such as Lean Theorem Prover, can turn math into a coding problem and provide compiler feedbacks: https://lean-lang.org
And just like AlphaGo, the Policy LLM and Value LLM can improve each other iteratively, as well as learn from human expert annotations whenever available. A better Policy LLM will help the Tree of Thought Search explore better strategies, which in turn collect better data for the next round.
@demishassabis
said a while back that DeepMind Gemini will use "AlphaGo-style algorithms" to boost reasoning. Even if Q* is not what we think, Google will certainly catch up with their own. If I can think of the above, they surely can.
Note that what I described is just about reasoning. Nothing says Q* will be more creative in writing poetry, telling jokes
@grok
, or role playing. Improving creativity is a fundamentally human thing, so I believe natural data will still outperform synthetic ones.