Anthropic Taught Claude to 'Dream', and It Made the Model 6x Better at Real Work
Claude Managed Agents now review past sessions, identify patterns, and refine their own performance over time. Early users report dramatically higher task completion rates without any prompt engineering.

Anthropic shipped a feature last week called "Dreaming" for Claude Managed Agents. The name sounds like marketing, but the mechanism is genuinely novel: after each session, the agent reviews what happened, finds patterns in what worked and what did not, and updates its own internal instructions for next time.
The result, according to Anthropic, is a dramatic improvement in task completion rates. Harvey, the legal AI platform built on Claude, reported roughly 6x higher completion rates on complex legal research tasks after enabling Dreaming. That is not a marginal gain. That is the difference between a tool you try once and a tool you rely on.
What "dreaming" actually means, technically
Let me be precise about what is happening here, because "dreaming" is a metaphor, and metaphors can mislead.
In human beings, dreaming occurs during REM sleep and involves the reactivation and recombination of memory traces. The brain replays experiences, strengthens some neural connections, prunes others, and sometimes produces novel combinations that feel creative upon waking. The function is not fully understood, but the leading theories involve memory consolidation, emotional regulation, and cognitive housekeeping.
Claude's "Dreaming" borrows the name but not the mechanism. The model does not enter a separate state. It does not sleep. What happens is this: at the end of a session, Claude generates a structured post-mortem of its own performance. The post-mortem records the original task, the approach the agent chose, where that approach succeeded, where it failed, and a set of self-generated instructions for handling similar situations better next time. That post-mortem gets stored in a persistent memory layer that future sessions reference.
The key technical distinction is that this is not training. The model weights stay frozen. Nothing about Claude's underlying capabilities changes. What changes is the prompt context the agent operates within. Over dozens of sessions, the agent accumulates a library of self-written guidance: "When the user asks for legal research, start with primary sources before secondary analysis." "When debugging Python code, check the imports first before diving into logic." "This particular user prefers bullet-point summaries over prose paragraphs." Each entry is a small piece of context that nudges the model's behavior in the right direction.
The reason this works as well as it does, and the reason I find it genuinely interesting, is that the agent is better at writing prompts for itself than most humans are at writing prompts for the agent. A human prompt engineer writes general rules: "be thorough," "double-check your work." The agent, reviewing its own specific failure, writes specific rules: "When parsing legal citations in the Bluebook format, verify the reporter abbreviation against the official list before returning results, because on three occasions I confused F.3d with F.4th." That specificity compounds. After fifty sessions, the agent has fifty specific rules, each targeting a concrete failure mode it personally experienced. No human prompt engineer would have written those rules because no human prompt engineer watched all fifty sessions.
What this tells us about LLM internals
There is a deeper observation here about how large language models actually work.
One of the consistent findings of AI interpretability research, the field that studies what is happening inside these models, is that LLMs are surprisingly bad at self-correction without external feedback. Ask a model to check its own work, and it often doubles down on the same error. The model's internal representations do not include a reliable "confidence" signal. It cannot introspect and think "I am probably wrong about this."
Dreaming sidesteps this limitation with a clever architectural hack. Instead of asking the model to correct itself in real time, it asks the model to review its own output after the fact and extract lessons. The difference matters. During a task, the model is using its full context window to process the user's request. After the task, the model can dedicate its entire context window to analyzing what just happened. The signal is cleaner because the analysis is the only thing the model is doing. It is the difference between trying to fix your golf swing mid-swing and watching a video of your swing afterward.
This pattern, using the model in a separate pass for analysis and improvement, is not new. Anthropic's own Constitutional AI approach uses a similar two-pass structure: one pass to generate output, a second pass to critique and refine it against safety principles. Dreaming extends that pattern from safety alignment into general task performance. The same technique that keeps Claude from producing harmful content also helps it get better at legal research.
Why Anthropic is the right lab for this
I want to note something about the institutional context here. Anthropic was founded in 2021 by former OpenAI employees who left specifically because they believed OpenAI was moving too fast on deployment and not fast enough on safety research. The company's entire identity is built around the idea that AI systems need careful engineering to be safe and reliable.
Dreaming fits squarely into that tradition. It is a safety mechanism as much as a performance feature. An agent that reviews its own mistakes and updates its behavior is less likely to repeat errors that caused problems. An agent that builds a memory of what worked and what did not is more predictable over time. These are safety properties, not just convenience features.
The contrast with other approaches is instructive. OpenAI's approach to agent memory has focused on persistent chat history and user-defined custom instructions. Google's approach emphasizes retrieval-augmented generation and tool integration. Anthropic's approach is qualitatively different: it treats the agent's internal learning process as a first-class feature. The agent does not just remember what you said. It remembers what it tried, what failed, and what to do differently.
How it works
Dreaming is not training or fine-tuning. The model weights do not change. Instead, after a session ends, the agent generates a structured summary: what the user asked for, what approach it took, what went wrong, what went right, and what it should do differently next time. That summary gets appended to the agent's persistent memory and referenced in future sessions.
Think of it as a running post-mortem that compounds. Session by session, the agent builds a playbook specific to the user's preferences, the task domain, and the failure modes it has personally encountered. No prompt engineering required. The agent writes its own prompts based on its own mistakes.
Anthropic's engineering lead described it as "giving the model a memory of its own cognition." That is a meaningful distinction from simple chat history. The agent is not just remembering what was said. It is remembering what worked, what did not, and why.
Why it matters
Most AI agent demos work beautifully the first time, because someone spent three days writing the perfect system prompt for that specific demo. Deploy the same agent in a real work environment with messy, varied tasks, and the completion rate plummets. The gap between demo and production is where most agent projects die.
Dreaming attacks that gap directly. Instead of the human debugging why the agent failed and rewriting the system prompt, the agent debugs itself. Over time, it gets better at the specific work a specific user actually does.
The catch is that Dreaming only works within Claude's Managed Agents platform, which requires Anthropic's API. This is not something you can enable on the consumer Claude chat interface. It is an enterprise feature, priced accordingly, and it locks you deeper into Anthropic's platform with every session, since the accumulated memory lives in Anthropic's infrastructure, not yours.
Still, the concept is likely to spread. If self-improving agent memory delivers 6x improvements in real-world task completion, every major AI provider will build something similar within a year. The question is whether anyone else's version works as well, or whether Anthropic has found an edge that compounds. Given the company's head start on the underlying interpretability research, I would not bet against them.