RLMs are the new reasoning models

Reasoning models were the first clear proof that language model capability can scale with test-time compute. Recursive language models (RLMs) ask what the correct abstraction for spending that compute is. The insight behind RLMs is obvious in hindsight: it is the direct marriage of two important axes of model capability — reasoning and tool use. This is more radical than it first sounds. RLMs collapse reasoning and tool use into a single inference abstraction: the model treats its own prompt as an environment it can inspect, slice, and recursively query. Context itself becomes the object of computation. This post is my attempt to explain why RLMs matter. I define what a RLM actually is, place it in the short history of reasoning and tool use, walk through the ~6 months of empirical results that have quietly turned “RLM” from a benchmark trick into the next reasoning paradigm, flag the honest limitations, and point at a few places to start building. What is a RLM? A Recursive Language Model, as introduced by Zhang, Kraska, and Khattab, is an inference paradigm in which a language model treats its input prompt as an environment rather than a fixed string. The root LM is given a REPL in which the prompt is bound to a variable it can inspect, slice, and partition programmatically. When it decides a region is worth a closer look, it issues a recursive subcall — to itself or another LM — over that slice and incorporates the result. Recursion bottoms out at the base model’s ordinary forward pass. One consequence is that input size is no longer a hard ceiling on the computation. The paper reports RLMs processing inputs up to two orders of magnitude beyond the underlying model’s context window and outperforming vanilla frontier LLMs and common long-context scaffolds across four long-context tasks. Beyond long-context answering, recent results demonstrate that RLMs are a powerful paradigm for a wide variety of challenging tasks. Reasoning & Tool Use — A Brief History Reasoning and tool use are related, but they are not the same thing. Reasoning is about how well a model can allocate inference-time compute to a problem: break it down, explore alternatives, verify intermediate steps, backtrack, and choose a better answer. Early reasoning gains came from methods like chain-of-thought, self-consistency, and later tree-search-style prompting. Those methods improve how the model thinks even when it never touches the outside world. Tool use is about whether a model can decide to call an external function, search engine, calculator, browser, code runner, or UI action; pass the right arguments; interpret the result; and continue. That is partly a reasoning problem, but it is also an interface and reliability problem: schemas, argument formatting, retries, stop conditions, state tracking, and error recovery. Toolformer made this distinction especially clear by treating tool use as something a model could learn during generation. Historically, the timeline looks roughly like this: 2022: reasoning first, mostly without tools. Chain-of-thought prompting showed that asking models to generate intermediate reasoning steps could dramatically improve multi-step reasoning. Self-consistency pushed this further by sampling multiple reasoning paths and selecting the most consistent answer. The key lesson was that a large share of “reasoning” gains could come from spending more inference-time compute on the same prompt, not just from adding more knowledge. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — https://arxiv.org/abs/2201.11903 Self-Consistency Improves Chain of Thought Reasoning in Language Models — https://arxiv.org/abs/2203.11171 Late 2022: the first real bridge between reasoning and acting. ReAct was the key milestone. It framed the model as alternating between reasoning traces and external actions such as retrieval or environment interaction. This was the moment the field started to see tool use not as a one-off API call, but as a loop in which reasoning selects actions and tool outputs reshape the next reasoning step. ReAct: Synergizing Reasoning and Acting in Language Models — https://arxiv.org/abs/2210.03629 2023: tool use becomes an API discipline, not just a prompting trick. Toolformer argued that models could learn when to call tools, which tools to call, and how to incorporate the results. Around the same time, vendors began standardizing function-calling interfaces. OpenAI’s June 2023 function calling release was a major product milestone because it made structured tool invocation reliable enough for developers to build on. This improved tool-use reliability faster than it improved deep reasoning. Toolformer: Language Models Can Teach Themselves to Use Tools — https://arxiv.org/abs/2302.04761 OpenAI, “Function calling and other API updates” — https://openai.com/index/function-calling-and-other-api-updates/ 2023 also deepened the separation betw

Scraped Article