A Bicycle for the (LLM’s) Mind

March 26, 2023

(Updated: May 21, 2023)

The recent release of native plugins for ChatGPT is a big deal. It is another great example of OpenAI’s strategy to ship fast and leverage the masses of excited developers and entrepreneurs out there to test the limits and accelerate the development of their services.

The current use of plugins leverages the fact that natural language - once mastered - is an extremely powerful domain to operate in. You can instruct the model to adapt to any given interface required to carry out a certain task. That this works is thanks to - and highlights the value of - OpenAI’s successful tuning of the base model (trained through “simple” next token prediction) via RLHF; getting the model to do what you want is just so intuitive. ChatGPT’s “general purpose well-behaved-ness” is instrumental to making plugins work without really any fine-tuning of the system, except for some amount of prompt engineering. Superior UX is not just great for user growth, it is a key enabler and makes the system as a whole more capable.

That said, I assume that the next big leap for GPT-”X”, and its likes, is when a foundational suite of these plugins is fully integrated into the training of the system.

Intrinsic limitations of the current systems

The not-so-controversial hypothesis I’m pursuing here is that the architecture (Transformer based, with attention to a fixed context ) and training procedure/objective (next token prediction) are sub-optimal for a lot of tasks. Presumably, a huge amount of computation and a large fraction of parameters are “wasted” on the retrieval of simple facts. Furthermore, it is easy to argue that the “single shot” auto-regressive flow of high dimensional numerical representations through stacked affine transformation - no matter how clever the inductive biases - is far from optimal for carrying out general computation or solving symbolic math or logic problems. A key limitation is that it cannot perform an adaptive amount of reasoning before outputting a distribution over the next token probabilities. Yet, all these capabilities might be needed to accurately predict the next word in a sentence.

That last part is worth pausing at for a while, to me, it is somehow the essence of why these models can become so capable. To accurately predict the next token, many of the processes we vaguely refer to as intelligent behavior, are required. This at least gives the LLM approach a theoretical chance to develop some level of intelligence. And even if this is conceptually rather straightforward in hindsight, it is still mindblowing to me just how strong capabilities have emerged from this simple objective.

Intuitive/associative intelligence

What these systems are well tailored for, on the other hand, is modeling the structure of language. Obviously. Much criticism has been along the lines of “these models are advanced parrots that model the structure of language statistically but have no real understanding of what it generates”. I see no point in arguing about what this “real understanding” entails, half an hour with ChatGPT makes it obvious that there is “sufficient understanding” to display impressive capabilities.

Instead of the limiting “models structure of language”, I would go as far as to say that they model intuitive or associative intelligence remarkably well. Similar to how we recall things by association and intuitively can fill in the gap or predict what someone is going to say next.

But just like we can’t solve a complex equation, analyze competitors or recall every single fact in/from our heads, LLMs need access to specific detailed information on demand and an iterative reasoning and exploration process with access to optimized tools for numerical computation and symbolic representation to solve more complex tasks. This is where the plugins come in - a bicycle for the LLM's mind.

Primary context & output vs. “the scratchpad”

If you have played around with ChatGPT, You have probably seen the benefits of the “layout you’re thinking step by step” strategy in prompt engineering. By asking the system to first break down the solution and then give the actual answer, we are more likely to reach the correct conclusion. This is visible e.g. from OpenAI open source evaluation framework, quoting from the repo:

"cot_classify" ("chain-of-thought then classify", i.e., reason then answer) expects that the parsable portion of the response (i.e., the portion containing the choice) will be at the end of the completion. We recommend this as the default as it typically provides most accurate model-graded evaluations.

It shouldn’t be that surprising that the model can display these “reasoning” capabilities. After all, there are lots of samples of this in the training data, the most obvious example being educational material, but the important takeaway is that this process is more or less required to solve many problems.

So, to really leverage the signal from the training data the model should be allowed to “stop and reason” before outputting the next token, also when this is not part of the training sample. We need to integrate the “reasoning scratchpad” into the main operating loop of the system but decouple it from the primary context and output of the model.

The most obvious implementation of this is to, literally, let the model generate text on a virtual scratchpad (explored in e.g. this Google research paper), i.e. a separate sequence of tokens, which it then attends to as well as the context of the primary token sequence. But here is also where the additional offloading to specialized systems, the plugins, would happen.

A challenge here is that such a setup isn’t end-to-end differentiable and must therefore be trained with a form of Reinforcement Learning. The process is dynamic and expensive to run compared to the highly optimized and “embarrassingly parallelized” computation that is required to “predict the next token and back-propagate gradients” through the LLMs static computational training graph. Therefore it benefits from separate stages of training, the same way that RLHF fine-tuning is applied on top of base models.

Reasoning = “Iterative search guided by associative intelligence”?

It seems meaningful to consider the scratchpad and plugin usage as a process of search. Connected to the previous claim, that this is expensive compared to regular training and inference, the model needs to develop a policy for when to leverage the scratchpad and external tools and decide “when it is ready” (or “gives up”) and to balance exploration vs. exploitation in some form of greedy tree search for solutions. It seems important to build in “resource consumption” as part of the reward function. To this end, I expect to see a form of Deep Learning guided monte carlo search, such as in Alpha Go/Zero, to become a key component in the next generation of more capable systems.

A foundational suite of plugins

So which plugins are needed? Some obvious examples, most already mentioned, are information retrieval, coding, math and logic, physics engines, etc. One hypothesis is that the execution of a general-purpose programming language could be the only “top-level” plugin needed, as the other needs could then be supported as libraries and frameworks inside it. One could argue that a general-purpose programming language is for structured computation, what natural language is for intuitive and associative intelligence. It can encapsulate math and logic as well as information retrieval. For example, rather than a custom module for information retrieval, one could execute a few lines of Python code to run web-search and apply arbitrary hard filtering and preprocessing of the content and only output the subset of content that is relevant to spend “expensive” attention on. There is also a lot (!) of public data for solving problems with code and this is an area where LLMs already show very promising results. A counter-argument for having code as an intermediate layer, especially for information retrieval, is that this limits the possibility of fully leveraging internal representations of the LLM. It might make sense to allow the generative model to trigger queries against an external corpus of information directly with vector search, based on its internal representations, rather than having natural language and or code as an intermediate step.

A key aspect to consider is that efficient execution of the plugins is critical when training these systems end-to-end with RL. One probably wants to get to several orders of magnitude faster startup and module loading times than the naive approach for executing a chunk of Python code, for example. I wouldn’t be surprised if a lot of research is currently going into this as well as, for example, “simulating web search” to avoid the latencies and costs of real web search. There should be a lot of interesting things that can be done about this with clever heuristics leveraging the specifics of how Deep Neural networks are trained in an RL setting.

A recipe for more capable general AI?

In summary, my hypothesis is that the key components for significantly more general and capable AIs, than today’s plain LLMs, look something like this:

A generative, acting, component that outputs natural text to some context (the “scratchpad”) and can
trigger actions via “plugins”, which outputs are added to the context.
Monte Carlo (like) tree search for evaluating alternative solution paths/lines of thought.
A critique component (“value network”) that prioritizes branches of the reasoning search tree as well as when to stop when reaching high enough confidence in the solution or a dead end.
RL-based training of all parts of the system end-to-end.

The current state-of-the-art LLMs (including finetuning via RLHF) are natural building blocks to use for 1. and 4. Hopefully, the architecture of the outer loop, providing query-based access to context and execution of computation, would allow the LLM component to be significantly more lightweight.

A large part of this idea is already incorporated in the “Auto-GPT” movement; there is a surge of task-centric autonomous agent frameworks based on LLMs, such as BabyAGI, AgentGPT, and Langchain. A big difference however is that these systems do not support training of the involved components end-to-end, which seems critical to reach the next step in general capabilities.

It is worth pointing out that the overall idea is pretty obvious and straightforward in the handwaving manner that it is presented here. To actually make it work and perform is a tremendous engineering and research quest. I believe some key challenges are:

Scalable and parallelizable implementation of the full training loop, including interfaces and execution of the “plugins”.
Effective design of the reward function and propagation of the reward signal (credit assignment), combined with
collection and curation of the relevant dataset/tasks that provides a high enough signal-to-noise ratio.

For the last two points, I think that the current capabilities of LLMs are already extremely helpful and that we could be moving closer to some soft form of human-in-the-loop singularity, where the speed of progress will continue to accelerate rapidly. Therefore, the first point might be the limiting factor for quite some time. Likely, a large number of “clever tricks” combined with efficient implementation, will be required for the system to reach a certain threshold where these “next-level capabilities” start to emerge.

Speaking of singularity, even with all of this assembled in a highly optimized manner, I believe it is yet another huge step to reach true autonomous self-improvement (as in curating its own training datasets, improving, exploring, and evaluating alternative implementations of itself at a foundational level). That said, given how much I recently have had to update my guestimates of how far we might get how soon, I can only expect to be surprised again.