TLDR
Recursive coding agents, inspired by recursive language models (RLMs), unify tool calling and reasoning to make AI agents reliable outcome-deliverers rather than mismanaged geniuses. RLMs externalize the prompt as a variable and use a read-evaluate-print loop (REPL) to symbolically decompose problems, enabling small models like Qwen 3.59B to beat frontier models on long reasoning tasks. Applying RLM principles to coding agents—through dynamic workflows (Claude Code) or the OpenProse language—allows agents to recursively call themselves, verify sub-agent work, and capture golden sessions for repeatable reliability.
Key points
- Today's agents are mismanaged geniuses: they have sufficient intelligence but lack reliable orchestration and verification layers.
- Recursive language models (RLMs) treat the context itself as the object of computation, marrying tool calling with reasoning in a REPL loop.
- RLMs can process information many orders of magnitude larger than their context window (tens of millions of tokens) and serve as powerful memory systems.
- A small model (Qwen 3.59B) using an RLM harness beats frontier models (Opus, GPT-5.4) on long reasoning tasks (Long CoT benchmark).
- RLMs are the new paradigm of test-time compute, unifying reasoning and code execution instead of relying solely on latent-space reasoning tokens.
- Recursive coding agents apply RLM principles by having a coding agent call itself (e.g., Pi calling Pi) with configurable depth, enabled by extensions like pi-recursive.
- Claude Code's dynamic workflows (released weeks ago) make it an RLM by allowing recursive sub-agent orchestration, as confirmed by Omar Khattab.
- OpenProse is a markdown-based programming language compiled by coding agents that can explicitly declare sub-agent work, verify results, and wire dependencies (skills/tools) for reliable recursive execution.
Tools mentioned
Techniques
- Recursive language models (RLMs)
- Read-evaluate-print loop (REPL) for symbolic exploration
- Dynamic workflows for recursive sub-agent orchestration
- OpenProse markdown spec for declarative agent workflows
- Golden session capture and reuse via Prose programs
- Hardcoded map-reduce vs. agent-decided decomposition
Takeaways
- Trust equals reliability: the bottleneck is orchestration, not intelligence.
- RLMs unify reasoning and tool calling, representing a new paradigm of test-time compute.
- Coding agents can be made recursive using dynamic workflows (Claude Code) or OpenProse, enabling reliable, repeatable outcomes.
- Recursive coding agents allow small models to outperform frontier models on long reasoning tasks.
Transcript (captions)
Hello there. My name is Raymond Whitcomb and today I'm going to talk about recursive coding agents, which is this idea of applying the lessons of recursive language models, RLMs, to coding agents. This is some work that I have done both in my independent research um Raw Works uh and also more recently in my role at OpenProse. So, to motivate this a little bit, we all want outcomes. We all want agents that are working on our behalf. We want reliable co-workers that are getting things done while we're doing something fun, while we're out on a hike, while we're cold chilling, while we're doing the do. And my argument and my experience is that the bottleneck to this is not intelligence. The models are intelligent enough. They know all kinds of things. They know the entire internet. But they can't reliably deliver outcomes. And so I can't trust them. So, as a very simple example, you know, one day I get almost a fully working SaaS app from a single prompt, granted a long prompt. The next day, and I swear this actually happened, cloud code empties the entire contents of my Solana wallet. Oops. Okay. So, that doesn't really instill trust. So, uh at the bottom here we got this this progression, okay? And we all want to move towards the the one on the right where we're just sort of sitting there and meditating and and things are manifesting. And so, where does that come from? This is from the AI engineer code. It's actually from the back of the t-shirt. AI engineer code, November 2025. Man, I hope I hope you're there. If you weren't, watch it on YouTube. It was it was amazing. So, here's the thesis. The thesis is today's agents are mismanaged geniuses. The intelligence is there, and the missing layer is how do we specify and manage and reuse and verify the work. So, this uh framing this phrase, the mismanaged genius, uh comes from Alex Zeng, Z Li, and Omar Khattab at MIT. Um and Alex and Omar are uh part of the authors of the original recursive language models paper. Uh I've also talked a little bit about this recently on Turning Post. I forgot to mention that these slides are actually a website, recursivecodingagents.com. So, you can click on them uh by going to this website. So, everything I'm going to show in here is is interactive. Okay. What are recursive language models? So, I like to say that in an RLM, the context itself is the object of computation. Um and this is essentially a marriage of tool calling and reasoning. We're going to talk a lot more more about that in the next slide. But, the idea is that the full prompt is not a simple user query. The full prompt is a variable. The full prompt could be a file or many files. Um and we have this read-evaluate-print loop, REPL, um that the agent is interacting with. In the original paper, that's Python. And the RLM is instructed to operate symbolically on that prompt. So, don't just read the whole thing into your context window. Um explore it symbolically. And uh even more, you don't even directly explore symbolically, or maybe you do a little bit of poking around, but have uh other LLMs, uh and I guess other RLMs, if you allow the recursion depth to be to be greater than one. Uh have these uh other recursive um sub agents and and again we'll get uh a little bit into the the weeds of the the lingo um sub RLMs sub LLMs uh do this symbolic manipulation to pick apart the answer and then work our way back up to a final answer. So, it looks something like like this in this in this tree below. So, my take here is that RLMs are the new reasoning models and I see this as the next paradigm of test time compute, inference time compute, whatever you want to call it. And why does it seem obvious or maybe like hey, why is this even a thing? Um I think it's very elegant because it's a very elegant marriage of two things, reasoning and code execution. So, the code execution is reasoning. Um and so instead of we we had long um we had chain of thought as a prompting strategy that evolved into reasoning models that explicitly expressed the chain of thought as the reasoning tokens. We already had function calling, tool calling, parallel tool calling, um and RLMs really puts that together in a way that gets amazing results. So, three very simple examples, one from the original paper U Long, the RLMs can process information that is many orders of magnitude larger than their context window, tens of millions of tokens. Uh what I showed in my own independent work was that the default RLM harness is itself a really powerful memory system. Um so, RLM with no modifications is essentially like a top 10 memory system and like you know, up there with all the people custom making memory systems and there's probably billions of dollars going into that. Um and uh with a little bit of modification, you you can get really amazing results uh using it as memory. I was also able to show state-of-the-art results where the RLM framework and specifically the DSPy implementation of it was able to get state-of-the-art uh results on long reasoning tasks. Now, there's this new benchmark Long CoT. I won't go into the details in depth, but the idea of this benchmark was that the problems are hard specifically because they require so many um steps of reasoning uh in in the like analysis sequence or the chain of thought um that most uh reasoning models, including the top ones, can't hold the thread for long enough. Um if you allow the uh RLM to solve the problem uh using a combination of code and recursive calls to sub agents, then a very small model, Qwen 3.59B, you could run this on a laptop, uh can actually beat So, Qwen 3.59B as an RLM can beat Opus and um and GPT-5.4, all the top frontier models as LLMs on these long reasoning tasks. So, they're extremely extremely powerful. So powerful that they are arguably too hot to benchmark. So, two examples here, on the left, a very high-profile uh case where um the Symbolica team has this RLM agent harness called Agentica. Within hours of Arc AGI-3 being released, where the top scores of all the frontier models were around two or three percent, the uh Symbolica team showed 30-something percent. This is crazy. They blew it out of the water within hours using RLMs as a framework. So much so that is uh, very much upset the Arc Prize team. And so, uh, they gave them what I'm interpreting as a consolation tweet. Uh, which as far as I'm reading the situation was essentially saying, you know, la-di-da, congratulations, um, but you didn't solve the problem the right way. And we don't like RLM harnesses. And so, uh, you can have this nice tweet, but we refuse to actually do the full private part of the ArcadeGI evaluation. Uh, which to me is just insane. Uh, in my own uh, work and and on the long COT benchmark, um, my results as well as Alex uh, from MIT, the RLM first author, uh, encouraged, let's say, the uh, the uh, leaderboard maintainers to actually make like a separate open harness leaderboard. Uh, so that the the results of the RLMs could be showcased without contaminating the original intent of the leaderboard, which was basically no tool calling is allowed. So, my take on this is I don't care. I don't care whether it's latent space or reasoning tokens or code execution. I want results. And I want AI programs that get those results. Okay, so this can feel close to a lot of other things. And I have I built a little rubric. There's a companion GitHub repo for this that you can go through if you want to see. Um, and so, what do we need to be an RLM? We have an executable environment, the prompt is externalized, there's code that's actually the thing calling the model, the model is able to pick the decomposition of the problem into the sub calls or sub agents, and the state itself is staying symbolic, right? So, obviously plain LLMs and rag and things like that don't don't meet those. Coding agents and sub-agents and loops, they get close, but they're not quite there. And again, this rubric here is not to like start fights or nitpick. It's just trying to explain like what what's the essence of of RLM and recursive coding agents. Another example that's close but no cigar would be hardcoded map reduces. And I would put this this project of lambda RLM in that category, which is essentially a way of saying, "Okay, I'm going to decompose the problem using lambda calculus into a map reduce." And then there are like LLM calls in that executing the map reduce. But the But the LLM is not deciding or the RLM is not deciding how to decompose the problem. And that I see as like a key element of this that makes it very agent native, you might say. Okay, so now RLMs, what about recursive coding agents? Okay, it looks the same to me. We just swap RLMs and LLMs for agents and sub-agents. And don't we have the same thing? And yeah, you do. And I think that you could take this perspective of haha, trick question, like RLM is a coding agent and it's already recursive, so what did I And that's fine, and I don't think that argument is wrong, but it doesn't really move anything forward. And so, what I'm interested in is this question of how can we apply the principles of RLMs to coding agents and make them actually useful for coding agents. And I've been very obsessed with this problem since the October RLM blog post came out in 2025. Okay, so I'll show some of my experiments on recursive coding agents. The first one was simply wrapping Alex's RLM package as a CLI. So, the idea was I just want to give my coding agent an RLM as a tool call. So, let's say we go we need to go sift through a 100 million token corpus. Uh well, now I just use this that tool and then the RLM does the RLM thing. So, that's interesting. And it can be very useful. Uh and a few other people have built uh things like that. Uh but then I thought, "Well, what would it really mean? What would be possible or how could it be possible to make the coding agent fully recursive? So, the coding agent harness like calls itself, like the exact version of itself. Um and how might you implement that? And that is what I called Y pie. Uh Y stands for the lambda calculus uh Y combinator. And pie, in case you haven't heard of it, is really awesome coding agent. It's incredibly minimal and it's specifically designed to be extensible. So, Mario uh wants you to write extensions for pie for new features that you would like rather than trying to stuff your ideas into the the main um agent. So, it's meant to be a very minimal core that you extend however you like. And when I originally had the uh recursive coding agent idea, I wanted to do it with pie. I was not able to use pie extensions to achieve this goal. Um and so, I had to fork it instead. I'm very excited to report that in anticipation of this talk, I revisited this and now pie has evolved um and the pie extensions have evolved such that you can uh make it fully recursive with a pure extension. So, I have both the pure recursive um extension pi recursive uh package as well as the y pi wrapper that is like a convenience wrapper for this. So, this is very quite literally a recursive coding agent in the sense that pi calls pi calls pi calls pi. You can set the depth however you want. Um and now I want to show a few other notable projects in the space. So, obviously there's the original uh implementation from Alex RLM. DSPY pi.rlm is uh my go-to uh when I'm especially doing benchmarking. Um that's how I got all these amazing results uh on some of these benchmarks. AXE, I think is a very interesting one because it's incredibly agent native. So, the AXE started out as a TypeScript variation on DSPY. When RLM's came out, they obviously implemented it uh but they did it in this way that's um enables then AXE agent to like write a whole TypeScript interface to another AXE agent and go all the way down uh the the recursive rabbit hole um which I think is really cool and very interesting. Just to showcase that this like REPL could be anything, there's a an example from Dan at OpenProse who made the Unix RLM. This is pure bash and the environment is just the Linux file system. So, that's a whole another angle of thinking about uh what's possible with RLM. And then lastly, and we'll talk more about OpenProse um but uh OpenProse as a language uh actually enables you to convert any coding agent into an RLM. I'll talk a little bit more about how to do that towards the end of the talk. Uh and then the OpenProse repo also contains a harness that executes it's a coding agent harness that uh will let you use Codex SDK or Cloud Code underneath and do an RLM style execution of these prose programs. So, is Cloud Code an RLM? This is like the the question that keeps getting asked. And the original answer on the day release day of the blog post was no, no, no, it's not. Uh but it was literally the first question that was asked or like asked on the very same day uh to the launch tweet. Um it's basically saying, "Hey, this is Cloud Code subagents, right?" Uh go back to my rubric if you want to like dig into some of the the nitty-ditty nitty-gritty details. Um but arguably now it is. So, uh over here on the right we see Omar saying, "Hey, congratulations Anthropic, uh Cloud Code is finally an RLM now that you have dynamic workflows." So, so what changed? And I think this is an interesting way of explaining um what's powerful about recursive coding agents and what RLM's even are um by using this example. So, dynamic workflows were released just a few weeks ago um and they make Cloud Code recursive or capable of doing these recursive uh workflows. Uh I would highly encourage you to read this blog post called A Harness for Every Task. It shows six different workflow patterns that are very powerful. Obviously, there's many more that you can achieve. And just to to show it, I I wrote two uh workflows for Cloud Code. Uh one that is explicitly not an RLM, so that's like a hard-coded map reduce workflow, and one that I'm arguing is, that is uh you could kind of think of it as like deep research over a file system. So, you know, pick pick a handle, um assign that to an agent, have it go do some analysis, bring back what it finds, etc. Uh again, these are in the companion repo. Uh and then now to open prose. So, so dynamic workflows are cool. They're only in Claude code. They're also not the only way to do this. So, what if you don't like Claude code? Or what if you do like Claude code, but you don't want to use dynamic workflows? You want something else. So, this is what open prose is all about. So, open prose is technically a programming language, but it is not compiled by your computer. It's compiled by your coding agent. Um, it's a markdown spec. It's logical English. You don't need to learn any kind of crazy syntax. And there's actually a command prose write that will get Claude code or Codex or your favorite or Pi or your favorite coding agent to write a dot prose dot md file for you. Um, so in that way it's similar to the the um, Claude code um, ultra code command where it like decide and write the workflows for you. And the the prose has the ability to turn any agent that's got a file system and sub agents into an RLM. So, this is open source repo. You can check it out. Um, and I've also written uh, a little bit more in depth about this for Turing Post in this article. And the key thing that I want to bring up with regards to the RLMs is that open prose can explicitly declare the sub agent work. So, again, I've made two kind of demo prose dot md files that are in this companion repo if you want to dig into the code. I'm not going to do that here in the slides. Uh, where you can break a problem up into smaller pieces that are assigned to sub agents, verify the work of those sub agents um, in the in the parent agent session. Uh, and what's even cooler is you can actually in open prose to the features that I added to the language or that you can add skills and tools as explicit dependencies. So, you can imagine a workflow where certain sub agent needs a very specific skill to do its role in the workflow or must have access to a certain CLI tool, for example, or it can't run and do its job. And so, there's a way in Pros to actually wire those in as dependencies to ensure that um not only that the way the work is done um is what you want, but actually that the sub agents are specifically configured with the tools and skills that they need to successfully do the work that you are declaring in the Pros contract. Okay, super cool. What can you actually do? So, I've got two examples from Cloud Dynamic Workflows, two examples from Open Pros. Uh repo scale migrations, this was kind of the launch post example, refactor like a huge thing uh all with a big swarm in parallel and then merge the whole thing together. That's super cool. Um this idea of like go after a directory and then deep research or deep analyze uh or deep process in some way recursively uh inside that is is another example uh I have here. You can do audits, uh bug sweeps, you can do adversarial things such as having a skeptical agent or um you know, a red team uh set of agents uh that are going to uh try to uh improve the system adversarially or in parallel. And then uh one really cool thing that you can do with Open Pros that I just added recently is kind of goes back to my very first slide, right? So, like one day I get this amazing result, the next day they trade away all my all my uh crypto currency, which is very small, thankfully. Um, but like, how do we get these things to be more reliable? And how do we get them, you know, we we have like a golden session and we have a great day. And now we want to capture that and use it over and over again. So, I built a system where you can take a golden session for Cloud Code, Codex Pi, whatever you want. And it's a Prose program that will actually have the agent deconstruct that session and turn it into a reusable Prose workflow, um, that again can involve this idea of recursive coding agents, um, to get you to a reliable, uh, way of getting to that golden state of performance over and over and over again. Recursion Recursive coding agents for the win. I really think this is very powerful. RLMs just kind of blew my mind when they first came out. And again, as you can probably see through this talk, I've been absolutely obsessed with applying the ideas of RLMs to coding agents. The three things that I hope you'll take away from this is one, like, trust is reliability. Like, how can we how can we trust something that isn't reliable? And again, my argument and this idea of the mismanaged genius is um, that the next step is not, um, more raw intelligence. It's actually, uh, behavioral. It's actually orchestration. I personally believe that, um, RLMs represent this new paradigm of test time compute, inference time compute, uh, where tool calling and reasoning are unified. And we reason through tool calling and we can, uh, recursively iterate. And one of those tools is to call another agent to go do it on some other specific task or subset of the problem. Uh, and then also, I hope we settle a little bit of this drama around like wait, like are are LLMs actually new? Aren't they just coding agents? Yes, coding agents can be LLMs. They aren't automatically LLMs. And so I've showed a couple of different Claude code dynamic workflows that can turn Claude code into an LLM as well as some ways of doing this with open pros that you can use with any coding agent. So, I see this as an incredibly powerful way of working with coding agents. I hope you will dig in more to recursivecodingagents.com. But, with great power comes great responsibility. So, until next time, please recurse responsibly. Thank you very much.
Jobs for this video
| Stage | Status | Attempts | Last error | Updated |
|---|---|---|---|---|
| summarize | done | 0 | — | 2026-06-26 01:18:16.967430+00:00 |
| summarize | dead | 0 | handler returned DEAD | 2026-06-25 22:04:14.491610+00:00 |
| transcript | done | 0 | — | 2026-06-25 22:03:44.707878+00:00 |
| metadata | done | 0 | — | 2026-06-25 22:03:30.578502+00:00 |