Memory Layer: How Agents Store and Retrieve Memory

Idea in 30 Seconds

Memory Layer is not just a fact store but a controlled layer for selecting, writing, and returning memory.

The agent should not send full history in every request. It reads from memory only what truly helps with the next step. Memory Layer should not blindly accumulate everything. Its task is to return little, but precise.

When needed: when the agent works not in one step but in a series of steps or sessions where consistency and personalization matter.

LLM can only see what is included in the current context. Memory Layer decides what from the past should be brought back.

Problem

Without a dedicated memory layer, the agent almost always works "from scratch".

This creates typical problems:

the agent asks again what it already knows;
answers become inconsistent across sessions;
too much irrelevant history gets into context;
important facts are lost in noise;
duplicates or conflicting versions of the same fact accumulate in memory;
agent personalizes answers based on stale or weakly supported data.

As a result, cost, latency, and answer errors increase.

Solution

Add Memory Layer as a separate layer for memory operations: what to store, what to return to context, and what to delete.

It separates "how to remember" logic from "how to think" logic, so the agent runs more stably.

Analogy: like a manager's notes about a client.
A manager does not keep the whole message archive in mind. They pull short relevant notes: what matters for this specific conversation.
Memory Layer does the same by returning only needed memory at the needed moment.

How Memory Layer Works

Memory Layer is a controlled layer between Agent Runtime and memory store that decides what to read, what to write, and what to clean up.

Diagram

Full flow description: Retrieve → Rank → Inject → Write → Compact

Retrieve
Runtime asks Memory Layer for facts for the current step.

Rank
Layer selects the most relevant records by topic, recency, and importance.

Inject
Selected memory is added to context before calling LLM.

Write
After the agent step, the layer decides whether a new fact should be stored, considering usefulness, stability, sensitivity, source, and TTL.

Compact
Old or duplicated records are compressed, updated, or removed by TTL/limit rules.

This cycle repeats at every step and helps the agent keep consistency across steps and sessions.

In Code It Looks Like This

PYTHON

class MemoryLayer:
    def __init__(self, store, max_items_per_user=200):
        self.store = store
        self.max_items_per_user = max_items_per_user

    def retrieve(self, user_id: str, query: str, top_k: int = 4):
        # Return only relevant memory, not full history.
        items = self.store.search(
            user_id=user_id,
            query=query,
            limit=top_k,
            min_score=0.7,
            exclude_expired=True,
        )
        return [item["text"] for item in items]

    def write(
        self,
        user_id: str,
        observation: str,
        tags: list[str],
        source: str = "user",
        sensitivity: str = "low",
        ttl_days: int = 30,
    ):
        if not self._worth_storing(
            observation=observation,
            tags=tags,
            source=source,
            sensitivity=sensitivity,
            ttl_days=ttl_days,
        ):
            return

        self.store.insert(
            user_id=user_id,
            text=observation,
            tags=tags,
            source=source,
            sensitivity=sensitivity,
            ttl_days=ttl_days,
        )
        self.store.enforce_limit(user_id=user_id, max_items=self.max_items_per_user)

    def _worth_storing(
        self,
        observation: str,
        tags: list[str],
        source: str,
        sensitivity: str,
        ttl_days: int,
    ) -> bool:
        text = observation.strip()
        if len(text) < 20:
            return False

        # Do not store short service phrases.
        if text.lower() in {"ok", "thanks", "done", "ready"}:
            return False

        # Do not write sensitive data into standard memory.
        if sensitivity == "high":
            return False

        # Trust only predefined sources.
        if source not in {"user", "tool", "policy"}:
            return False

        # Memory must be stable and useful, not random noise.
        stable_tags = {"preference", "constraint", "profile", "goal"}
        if not any(tag in stable_tags for tag in tags):
            return False

        if ttl_days < 1 or ttl_days > 365:
            return False

        return True

What It Looks Like During Execution

TEXT

Request: "Prepare a weekly meal plan for me"

Step 1
Agent Runtime: calls Memory Layer.retrieve(...)
Memory Layer: returns relevant facts -> ["peanut allergy", "vegetarian diet"]
Agent Runtime: adds these facts to Context
Agent Runtime: calls LLM.decide(...)

Step 2
LLM: returns -> final_answer (plan without peanuts and meat)
Agent Runtime: passes new observation to Memory Layer.write(...)
Memory Layer: stores fact -> "user wants budget up to $80/week"

Memory Layer helps the agent not forget important facts and not overload context with extra details.

When It Fits - and When It Does Not

Memory Layer is useful when the agent must remember facts across steps or sessions. For one-time requests it is often unnecessary.

Fits

	Situation	Why Memory Layer Fits
✅	Agent works with user across multiple sessions	Memory stores important facts and removes repeated clarifications.
✅	Response personalization is needed	Layer returns user preferences and constraints before the LLM step.
✅	Agent runs a long multi-step workflow within one run	Layer keeps intermediate conclusions and important facts without constantly inflating context.
✅	Context grows fast and has limits	Instead of full history, the agent gets only top-k relevant facts.

Does Not Fit

	Situation	Why Memory Layer Does Not Fit
❌	One-shot request without follow-up dialogue	Separate memory layer adds complexity without visible benefit.
❌	Need facts that change fast: prices, statuses, availability, live data	Here fresh retrieval or tool call is better than relying on memory.
❌	Product policy forbids storing data across sessions	Long-term memory will violate privacy and compliance requirements.

In such cases, one model call is often enough:

PYTHON

response = llm(prompt)

Typical Problems and Failures

Problem	What Happens	How to Prevent
Stale memory	Agent uses old fact and gives wrong answer	TTL, record versions, and periodic refresh
Incorrect personalization	Agent overconfidently personalizes response based on weak or stale memory	Fact freshness checks, confidence threshold, and user confirmation before personalization
Noise in memory	Context receives many low-utility records	Write rules, ranking, and `top_k` limits
Cross-user leak	Agent reads memory of another user or tenant	Isolation by `user_id/tenant_id` and access checks
Poisoned memory	Dangerous or false instruction enters memory	Sanitization, trusted sources, manual review of critical records
Context limit overflow	Memory volume in context becomes too large for LLM	Compression, deduplication, and short summaries instead of raw history

Most Memory Layer issues are solved via clear write rules, strong ranking, and access control.

How It Combines with Other Patterns

Memory Layer does not control the whole agent. It is responsible only for high-quality memory operations at each step.

Agent Runtime — Runtime decides when to access memory, and Memory Layer decides what exactly to read, write, or delete.
Tool Execution Layer — tool calls can read or update memory through the controlled execution layer.
Memory-Augmented Agent — this pattern directly relies on Memory Layer.
RAG Agent — RAG retrieves external knowledge, while Memory Layer keeps internal experience of specific agent/user.

In other words:

Agent Runtime defines when the agent accesses memory
Memory Layer defines what exactly is stored and returned back

How This Differs from Agent Runtime

	Agent Runtime	Memory Layer
What it controls	Whole agent loop	Memory writing, retrieval, and quality
When it works	At each execution-loop step	During memory read/write
What it returns	Next state or final answer	Relevant facts for context
Main risk	Wrong loop and limit control	Stale, noisy, or unsafe records

Agent Runtime is the "conductor" of the whole process.

Memory Layer is the "system memory" that keeps response consistency.

In Short

Quick take

Memory Layer:

stores important facts across steps and sessions
returns only relevant records to context
compresses or deletes stale memory by rules
protects data through access isolation and write rules

FAQ

Q: Is it enough to pass full history into prompt?
A: For short scenarios, sometimes yes. But in long dialogues it is expensive, slow, and noisy. Memory Layer gives short relevant memory instead of full log.

Q: How is short-term memory different from long-term memory?
A: Short-term is needed for current run or session. Long-term stores important facts across sessions and is reused later.

Q: Can we write every agent step into memory?
A: Technically yes, but it is poor practice. Better store only useful facts by write rules; otherwise memory quickly turns into noise.

Q: Does Memory Layer replace RAG or tool calls?
A: No. Memory Layer stores internal facts and agent/user experience. For fresh external data, retrieval or tool calls are usually required.

What Next

Memory is useful only when it stays controlled. Next, see where memory connects to execution and policy:

Agent Runtime - how runtime pulls memory into each iteration.
Multi-Tenant - how to avoid mixing context and data between customers.
Policy Boundaries - how to restrict access to sensitive memory.
Production Stack - how to combine memory quality, audit, and operational control.