…or why AppleTV's hit show Severance matters for the future of AI.
As DeepSeek R-1 recently shook the AI community, much attention and resources are now focused on reasoning models trained through Reinforcement Learning. (See, for example, this trend analysis by Databricks CEO.)
Reinforcement Learning is central to achieving the impressive reasoning capabilities demonstrated by DeepSeek R-1 and OpenAI's o1. In fact, RL-based training is almost essential when creating systems that perform multiple actions sequentially—what we can call a "reasoning loop"—potentially involving external tool usage like code execution or information retrieval. (Something I explored in this post back in March 2023.)
Why RL? Mainly because:
- We don’t necessarily know the ideal intermediate to reach good answers—or, if we do, annotating those intermediate steps can be prohibitively expensive.
- These reasoning loops are generally not differentiable end-to-end, at least when external tools are involved.
RL sidesteps these challenges by simply requiring verification or grading of final answers. Hence, the growing focus on "verifiable domains" like math and coding to push LLM capabilities.
At the same time, we have Thomas Wolf’s recent hot take and commentary on Dario Amodei's "Machines of Loving Grace", where he eloquently argues that scaling up "top-percent" students will not magically give us Einsteins, expressing skepticism about current training regimes' ability to produce genuine novelty.
Thus, we face a puzzle: RL-based reasoning thrives in clearly verifiable domains, yet true novelty by definition seems hard to verify using current (already-known) information. Is there a way around this?
In my view, there's only one realistic (if perhaps obvious) solution: Forecasting. Or, more technically speaking, strict "causal masking" of knowledge during training. There could be enormous potential in training models explicitly designed to predict genuinely new information—forcing them to rely on reasoning and extrapolation rather than retrieval and interpolation. If we can solve causal masking effectively, "future" knowledge at any cutoff point in time essentially becomes a verifiable domain.
However, causal masking is tricky. You effectively only get one shot at backpropagating learnings from a new event because once information is encoded in the model's weights, the model can retrieve it associatively rather than through genuine reasoning. Moreover, training data must be meticulously timestamped—any non-timestamped source is risky because it might leak "future" knowledge into the training process.
I don't intend to outline a detailed solution here, but some potential approaches come to mind:
Pre-train on all data chronologically
This is the naive starting point but does not make sense practically. After the first mention of a new event, all subsequent related data becomes effectively unusable. Moreover, this approach would produce huge distribution shifts over time and high intra-batch correlation—both undesirable effects. Given that our goal was to foster reasoning through RL, chronological pre-training alone might require significant effort for relatively sparse high-quality novelty and forecasting in the “next token”.
Chronological RL Post-Training, Possibly with a Dedicated Forecasting Module
Instead, one could perform standard pre-training but impose an intentionally early cutoff date (e.g., 2020). Subsequently, we could generate (semi-automatically, using state-of-the-art LLMs) large corpora of questions and verifiable answers that cannot simply be retrieved from pre-training data. Again, there's the challenge that training on problems from a given time bin t inherently risks embedding information about events up to that time. Even if we carefully control what information is in the question and answer, once be back-prop gradients through the correct vs. many incorrect reasoning traces, we will embed information about the future into the model weights. Still, you could mitigate this by grouping problems into clearly defined time bins (e.g., one per day) and training sequentially.
Two potential issues arise:
-
When training on problems from time bin t, information from bin t-1 may bias reasoning towards recent events, potentially limiting the model's long-term forecasting capabilities. However, many real-world events typically have limited precursors from just days or weeks prior, likely counteracting this issue.
-
We miss out on next-token pre-training on recent data, making new information less easily retrievable through the model's "regular associative memory." One potential remedy is alternating RL training with continued next-token prediction training, ensuring recent information remains accessible. If this interferes with the forecasting-focused RL, it might help to designate a subset of model parameters specifically updated only via RL phases (after the initial pre-training).
Precision Amnesia
The core objective of RL forecasting training would be to compel the model to reason toward the answer rather than simply retrieve it from memory. Ideally, we would inject targeted "precision amnesia," selectively blocking episodic recall of information learned after the target event—either broadly or at least to prevent direct retrieval of event answers.
Interestingly, this is reminiscent of the premise of the show Severance, where a brain implant allows branching a new consciousness with retained skills and semantic knowledge but without episodic memories (highly recommended viewing, by the way!). Perhaps techniques from mechanical interpretability can help achieve something similar for LLMs?
Closing Thoughts
I'm not claiming this approach will take us all the way to reliable conversion of compute into scientific breakthroughs (echoing Thomas Wolf’s ultimate ambition). However, given the astounding achievements reached through simple next-token prediction, I'm excited to see what scaled-up next-event prediction might bring.