Memory is Not a Database – It’s the Substrate of Thought
“Memory is probably my favorite recent feature,” Sam Altman mused earlier this year, reflecting on how an AI that remembers can “get to know you over your life, and become extremely useful and personalized”[1]. He was pointing to something deeper than a product update. Memory, in Altman’s view, is what makes an AI feel less like a tool and more like a partner. It transforms the experience from a one-off transaction to an ongoing relationship[2]. I think he’s right, but I think he might also be conflating it with what we as humans understand as memory. In humans, memory constitutes identity – it’s the thread that ties our past to who we are now – and it’s the basis of any long-term relationship. Without memory, there is no continuity of self or understanding of others. So when we talk about building agentic AI systems (the kind that carry out goals autonomously over time), we’re really talking about building memory. And not just any memory, but a living system of memory that can serve as the substrate for cognition, adaptation, and identity in these agents.
I have been wrestling with this idea as I watch today’s AI agents struggle with extended tasks. The dirty secret of current LLM-based agents is that their so-called “memory” is brittle and shallow. They can’t reliably carry information or intentions across multiple sessions or complex goal sequences. You see it when a conversational agent forgets your name or repeats itself after a few turns. You see it when AutoGPT spins in circles, losing track of what it’s already done. In fact, developers analyzing AutoGPT found that its tendency to get stuck in loops or go off the rails stems from a finite context window and lack of long-term memory[3]. With no true memory of past actions, it can’t focus on its objectives and keeps trying the same things over and over. This is a fundamental architectural gap. Modern AI systems are memory silos, each chat or session isolated from the next[4]. Start a new conversation, and it’s a blank slate every time – no accumulated learning, no sense of “self” or history carried forward. As researchers recently put it, today’s LLMs rely only on static model weights and short-lived context, which limits any long-horizon coherence or continual learning[5][6]. It’s like dealing with an amnesiac savant: brilliant at single-turn tasks, hopeless at sustained, evolving ones.
The Illusion of “Memory” in Today’s AI
It’s worth pausing to define what we mean by “memory” in AI systems. Right now, when an AI practitioner says an agent has memory, they usually mean one of a few things:
- Expanded context windows: Simply cram more past text into the prompt. (GPT-4 can take 32k tokens, so just include the conversation history or docs up to that limit.)
- Vector databases + RAG: Store chunks of text embeddings and do a similarity search to retrieve relevant snippets when needed (Retrieval-Augmented Generation)[7].
- Summarization of history: Keep a rolling summary of past interactions so the model can be “reminded” without full logs.
- External file or DB logging: For agents like AutoGPT/BabyAGI, write important info to files or a database so it can be read in future steps[8][9].
These techniques do help. Vector search means an agent can look up facts it saw hours or weeks ago. Summaries prevent total forgetfulness. Yet, none of these truly solve the problem of robust, long-horizon reasoning and behavior. They treat memory as storage and retrieval – a passive database to query – rather than as an active, structural part of cognition.
Consider the RAG approach: it’s basically a smarter autocomplete with footnotes. The model doesn’t actually learn from new information; it just fetches it when prompted. The “memory” lives outside the model’s core, and there is no built-in mechanism to update the model’s understanding over time. This is why one can’t interrogate a typical LLM about something it discussed last month unless you explicitly fed those details into the prompt. Each session, each prompt, is stateless in the fundamental sense[7]. We’re faking a working memory by stuffing context in front of the model repeatedly. And as context windows grow, we hit diminishing returns – even degrading returns. Empirically, models suffer context rot: performance drops as more tokens are packed into the prompt, especially if much of it is irrelevant detail[10][11]. In other words, simply remembering “more” via brute-force context can actually confuse these systems. The memory is there, but the cognition isn’t.
Then there’s the issue of orchestration. Human-like memory, what I anticipate most of us think of when the term is uses, really isn’t just recalling facts. Rather, it includes when to recall, what to recall, and how to use what’s recalled. Our brains constantly sift, prioritize, and even distort memories in service of the task at hand (usually without us realizing). Current AI agents don’t have an analog of this metacognitive control. An AutoGPT agent doesn’t decide what to forget or which past result to focus on – it just dumps whatever it has into the next prompt until the token limit hits. Developers end up manually crafting prompt engineering rules to manage this, essentially playing the role of a “memory scheduler” for the AI. It’s ad-hoc and brittle. As one LangChain article quipped, imagine a coworker with no memory – you’d have to repeat yourself endlessly[12] (like a parent or something, weird 😁). That’s the UX today: frustrating, unless we hand-hold the agent’s memory every step of the way.
Memory as the Engine of Understanding (Not Just Recall)
If we step back from the implementation details, there’s a broader thesis emerging: Memory is is the substrate upon which cognition is built. In a real sense, memory is the architecture of the mind. It’s not a database, rather it’s a process. My main man Bernard Lonergan, whom you all likely know by now if you’re reading this Substack, wrote about cognition as layered operations – experiencing, understanding, judging, and deciding – each building on the last. While I won’t dive into Lonergan’s texts here, that perspective inspires a useful way to think about memory analogically in our AI agents:
Memory is layered. Not all “memories” are equal; there are different kinds serving different roles. We see parallels to human memory: procedural memory (knowing how), semantic memory (knowing facts and concepts), episodic memory (remembering specific events)[13][14]. Current AI implementations mimic these in rudimentary ways. For instance, an LLM’s weights plus its code represent a kind of procedural memory – the skills and biases baked into it. Meanwhile, a vector store of facts extracted from past chats acts as semantic memory (a repository of learned knowledge)[14][15]. And a log of past actions or chain-of-thoughts is akin to episodic memory (records of what happened)[16][17]. But in today’s agents, these layers barely interact or update. They are static or siloed. A true memory system will need to unify these layers, so that what an agent experiences (episodes) can update what it knows (semantic) and even how it operates (procedural). This is exactly the idea behind new research like MemOS, which introduces MemCubes that encapsulate various memory types – from plain text knowledge to activation states and model weight updates – under one framework[18]. In other words, memory spanning from the transient “working memory” of activations to the long-term parameters can be managed together.
Memory is selective and interpretive. Humans forget most of what hits our senses – and thank goodness, or we’d be flooded with useless data (just ask one of the few people suffering from Hyperthymesia. What we do remember, we compress and weave into our existing worldview. Our memories are stories and schemas, full of omissions and embellishments. Why? Because memory serves understanding, not the other way around. An AI agent will similarly need to choose what to remember and at what level of detail. It might record that “Task X was completed and yielded Result Y” without keeping every log line of how it got there. It might abstract a user’s preference as “user likes direct answers” instead of storing every question they asked. This selection is guided by an orientation toward intelligibility – a drive to capture the essence of experience in a useful form. Today’s agents treat all data points a bit too evenly. A smarter agent would, say, attribute meaning and context to memories: This piece of information was wrong, that feedback was positive, these facts relate to topic Z. By tagging and structuring memories, the agent can later retrieve not just a raw snippet, but the right kind of memory for the situation (e.g. “I recall you prefer a casual tone in emails” vs. a random past email).
Memory is dynamic and fallible. The strength of human memory is not in never forgetting – it’s in the constant reworking of what we know. We revise memories in light of new evidence. We generalize from specifics and, sometimes, we mercifully drop details that no longer matter. A rigid memory is a brittle one. AI systems to date have been mostly static: once the model is trained, its “knowledge” is frozen except for whatever trickles in through a prompt. We’re starting to see this change. With tools like fine-tuning or online learning (carefully controlled), an agent could update its semantic memory when it encounters new facts. Even without weight updates, it could maintain a mutable knowledge base and a sense of its own trajectory (what it’s done and why). Crucially, it should also learn to forget. Yes, forgetting is a feature! Effective cognitive systems must discard or de-prioritize outdated and irrelevant information to make room for the new. If our AI assistant learned about a user’s old job and now the user switched careers, the assistant should gradually attenuate the importance of the old info. Without a mechanism for graceful forgetting, the “memory” becomes a junk heap that might clutter the agent’s decisions.
In short, building memory for an AI agent is less about databases and more about building a mind. It means instilling the kind of layered, purposeful recollection that underpins human thought. When memory works, it feels like understanding. The agent is using past experience to shape present reasoning.
Why Today’s Agents Fall Short (Examples from the Frontier)
It’s illuminating to examine how current agent frameworks attempt to deal with memory, and where they hit a wall. Let’s walk through a few:
AutoGPT & BabyAGI: These autonomous agent prototypes burst into the scene with promises of multi-step reasoning. Under the hood, their “memory” was often a combo of a short-term context (the prompt that grows with each step) and an external file or vector store for long-term info. In practice, they quickly exposed the fragility of this setup. AutoGPT, for example, would frequently lose the thread, endlessly looping on a subtask because it couldn’t truly recall it had tried that path already[3]. Developers noted that lack of long-term memory and the finite context window were primary culprits[19]. BabyAGI, which managed a task list and a vector DB of results, could juggle simple routines but struggled as tasks became more open-ended – the agent had no deeper narrative of what it was doing, just a bunch of past notes. These projects deserve credit for highlighting the need for memory: they often integrate functions to save important info to a vector database and retrieve it later[20]. Yet, without more sophisticated filtering and understanding of those past results, retrieval alone wasn’t enough. The critique here is that memorization != comprehension. Storing the outcome of every tool use or API call doesn’t yield a system that knows why it did things or when to change strategy. The lesson from AutoGPT and BabyAGI is clear: long-term autonomy demands an integrated memory architecture, not a bolted-on log.
LangChain (and similar frameworks): LangChain introduced the community to the idea of pluggable memory modules. You can give your agent a conversation buffer memory, or a summary memory, or even a vector store backed memory for longer recall[21][22]. This is immensely useful for builders – it’s a toolkit approach. For example, a customer support bot can be equipped to remember the user’s name and last issue, because the developer adds a short-term memory buffer. Or a research assistant agent can have a semantic memory: after each interaction it extracts key facts and stores them, so later it can retrieve “facts the user has provided” to avoid asking again[23]. LangChain’s blog even maps these ideas to psychology: distinguishing semantic vs episodic memory, etc., and notes that each application might need to remember different things[24]. A coding agent might recall user preferences about code style, while a travel planner agent remembers the traveler’s past destinations[24]. This specialization hints at a verticalization of memory – which I’ll return to in the conclusion. The limitation, however, is that LangChain leaves the cognitive orchestration to you. It doesn’t provide a unified brain; it provides brain pieces. If you assemble them well, you get a better agent. If you don’t, you get an agent that either forgets or overloads itself. There’s a reason many experienced devs say building a good LLM agent is more about managing prompts and memory than about the model. We’re essentially designing crude, manual memory management policies. This is akin to programming before high-level memory management – it’s powerful but error-prone.
Vector Databases & External Knowledge: These deserve special mention because vector DBs are often hyped as the solution for AI memory at scale. Indeed, tools like Chroma, Pinecone, or Weaviate have made it straightforward to spin up a massive associative memory: embed everything and let similarity search retrieve relevant chunks on demand. It works great for question-answering (think GPT answering by pulling relevant documentation). But as a cognitive memory for an agent, it’s incomplete. The vector DB doesn’t tell the agent when to search, what to store, or how to use what it gets back. Those are decisions external to the database. Without careful orchestration, an agent either fails to retrieve what it needs or retrieves too much irrelevance (leading to the dreaded context rot). Put simply, a vector search is a librarian, not a psychologist. It can hand you information; it won’t integrate that information into the agent’s sense of self or mission. Furthermore, pure semantic similarity can miss the point – if an agent is debugging code, the fact that it talked about a similar bug two weeks ago is only useful if the agent recognizes the situation is analogous. That kind of analogical link is beyond a naive vector similarity. It requires a higher-level understanding that “this problem is like that past problem,” a leap that current semantic memory setups don’t achieve.
MemOS and Emerging “Memory OS” Research: On the horizon, efforts like MemOS are trying to redesign the AI stack around memory as a first-class citizen[25][26]. MemOS treats memory not as a single data store, but as a manageable resource akin to CPU or disk in an operating system[25][26]. It introduces scheduling algorithms to decide which memories to keep in “RAM” (active context) and which to page out to long-term store, and mechanisms to transform transient experiences into durable knowledge[27][28]. Notably, the MemOS architecture acknowledges the heterogeneity of memory: transient activation states, intermediate computation results, and permanent knowledge updates are all handled under one roof[18][26]. Early results show dramatic gains in tasks that require connecting information across many steps or time intervals[29][30]. This suggests that the bottleneck was indeed how we’ve been handling memory. When you give an agent a structured way to learn from experience (not just recall facts), its effective intelligence leaps. One paper reported a 159% improvement in temporal reasoning tasks by using a memory-centric approach[31][29]. Numbers aside, the philosophy here is crucial: memory isn’t an add-on, it is the system. We’re essentially building an internal knowledge base that the AI can grow and refine through use, blurring the line between “pre-trained model” and “experience-trained agent.”
Each of these examples teaches us something. The failures of AutoGPT teach us that naïve memory leads to chaotic behavior. The experience with LangChain shows that memory must be tailored and orchestrated for the use-case. The vector DB’s strengths and weaknesses demonstrate that raw retrieval needs guidance. And the new research hints that integrating memory deeply can unlock new levels of performance.
Designing Memory as a Cognitive Control System
Where does this leave us, the builders and practitioners? It’s time to move beyond thinking of memory as a static database and start designing it as a cognitive control system. In practical terms, that means building agents that don’t just have memory, but actively use memory the way a mind does: to perceive, to decide what matters, to learn, and to adapt.
Here’s a short agenda I propose – a loop of operations for any agent that aims to have a human-like memory architecture:
Perception: First, capture the raw experience. Log the conversation turn, the tool output, the user feedback, the sensor reading – whatever the agent is dealing with. This is the transient trace of “what just happened.” Without perception, there’s nothing to remember.
Selection: Immediately filter and highlight what’s important about this experience. Did the user express a preference or correct the agent’s mistake? Was there an unexpected outcome from an action? The agent should extract salient details (possibly via an LLM prompt asking “what are key facts or results here?”). Not everything goes forward; noise must be left behind.
Encoding: Translate the selected information into an internal representation that can be stored and queried efficiently. This could mean embedding it as a vector, structuring it as JSON with fields (e.g. {"user_pref": "casual_tone"}), or compressing it into a summary. Encoding is about preparing the memory for future use.
Attribution: Link the memory with context: source metadata, timestamps, causal tags, relevancy scores. For example, note when and why this memory was formed. “User said they disliked the last recommendation [during travel planning on Aug 2025].” Attribution enriches the memory so that later the agent can retrieve not just a fact but the story around that fact (to avoid misusing it out of context).
Consolidation: Integrate the new memory into the agent’s long-term store. This might involve merging it with existing knowledge (e.g., update the profile of the user’s preferences), or storing it in an organized memory index. It could also trigger a model update in systems that allow learning on the fly. Consolidation is where ephemeral observation becomes lasting knowledge. It might be done during “idle” times or as a background process, analogous to how our brains consolidate memories during sleep[32][33].
Retrieval: When reasoning, have a mechanism to pull relevant memories at the right time. This is classical information retrieval augmented with the agent’s own judgment. The agent might ask itself: “What past experiences are like this situation?” and do a search of its memory (using metadata and embeddings). Retrieved items then have to be integrated into the agent’s working context or chain-of-thought. This step is where classic RAG comes in, but now it’s one part of a larger loop.
Revision: With new feedback or outcomes, go back and update existing memories. If the agent had a stored belief “Solution X works for problem Y” and it later fails, the agent should mark that memory as no longer reliable, or append the condition “(except in cases Z).” This is continuous learning in micro. Over time, an agent’s memory should evolve – refining concepts, correcting errors, re-framing narratives. Revision also covers re-encoding: maybe the initial summary was too coarse and the agent decides to store more details after realizing their importance.
Forgetting: Finally, implement graceful forgetting or deactivation. Not all memories need to live forever. Some can be compressed further, archived, or deleted. The agent might maintain a sliding window or a relevance decay – for instance, memories that haven’t been accessed in a long time or relate to now-irrelevant contexts get phased out. Forgetting is crucial for efficiency and for avoiding the clutter of contradictory or outdated info. It’s also a form of regularization: preventing the agent from overfitting to the past when the situation has changed.
This cycle turns memory into an active faculty. It’s essentially a control system: perceiving signals, updating internal state, and feeding back into decisions. Not every application will need the full loop, but I suspect any agent that aspires to long-term autonomy will touch all these stages in some form. And importantly, this is where vertical applications can shine. When you focus on a specific domain – say a medical diagnosis assistant, or a coding co-pilot, or a customer service bot – you can craft these memory operations with domain knowledge. You know what to pay attention to (selection), how to represent it (encoding), and when to recall it (retrieval) in that context. A vertical AI app can thus implement a robust memory tailored to its use, and in doing so, teach us general lessons. For example, a medical agent might develop a specialized forgetting strategy for outdated research findings (medical knowledge moves fast), which could inform how any AI handles time-sensitive knowledge. A coding assistant might learn to attribute memories by codebase and project, offering insight into context-scoped memory management that generalizes beyond coding.
In building these systems, we should remember Altman’s provocation: the endgame is an AI that “knows you, your whole life.” But to get there, memory can’t just be bigger context windows or bolt-on databases. It must be the core of the agent’s cognitive architecture. The journey from here to that ideal will be messy and will require rethinking a lot of assumptions in AI design. Yet, I’m optimistic. We are, in a sense, rediscovering in machines what nature evolved in us – that memory is the foundation of intelligence. An AI that can form, organize, and use its memories fluidly is an AI that can learn, adapt, and maybe even have something like a personality or identity over time. For those of us building the future of these agents, the mandate is clear: design for memory first. Treat it not as an afterthought but as the very substrate upon which your agent’s mind is built. Everything else – coherence, reasoning, usefulness, safety – will flow from that foundation. This is hard, deep work, but it’s the kind of work that makes the difference between yet another demo and a truly transformative AI partner. And as we iterate in vertical slices and share what we learn, we’ll be co-authoring a new general playbook for machine memory, one that could unlock the next era of agentic intelligence.
Sources:[1][2][3][4][5][6][7][10][11][12][13][14][15][16][17][18][26][34][28][20][22][24]
[1] OpenAI's Memory Trap & Its Implications for Consumer Freedom
https://www.inrupt.com/blog/openais-memory-trap
[2] Opt-In To OpenAI's Memory Feature? 5 Crucial Things To Know
https://en.wikipedia.org/wiki/AutoGPT
[4] [5] [6] [7] [18] [25] [26] [27] [28] [29] [30] [31] [34] Chinese researchers unveil MemOS, the first 'memory operating system' that gives AI human-like recall
[8] Decoding AutoGPT: Understanding the Magic Behind the Code
https://medium.com/@evyborov/decoding-autogpt-understanding-the-magic-behind-the-code-991177b62583
[9] AutoGPT Explained: How to Build Self-Managing AI Agents | Built In
https://builtin.com/artificial-intelligence/autogpt
[10] [11] Context Rot: How Increasing Input Tokens Impacts LLM Performance | Chroma Research
https://research.trychroma.com/context-rot
[12] [13] [14] [15] [16] [17] [23] [24] [32] [33] Memory for agents
https://blog.langchain.com/memory-for-agents/
[20] How do tools like AutoGPT get around the size limit? - Reddit