MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

About MEM1

Modern language agents must operate over long-horizon, multi-turn interactions, retrieving external information, adapting to observations, and answering interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning—integrating prior memory with new observations while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet scalable approach for constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains—including internal retrieval QA, open-domain web QA, and multi-turn web shopping—show that MEM1-7B improves performance by 3.5x and reduces memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, while also generalizing beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative for training long-horizon interactive agents, optimizing both efficiency and performance.

How MEM1 Works

Diagrams Demo Video

Flow Diagram

MEM1 generates an Internal State (IS) at each step, blending memory updates with the reasoning process to maintain a constant-size working memory. Unlike previous agents that accumulate ever-growing context, MEM1 updates and overwrites its internal state, discarding earlier thoughts and actions.

Memory Efficiency

Conceptual comparison: Traditional agents accumulate context linearly while MEM1 achieves near-constant memory usage.

Training MEM1

We train MEM1 via a reinforcement learning framework that integrates a specialized attention masking scheme tailored for dynamic memory, a custom multi-turn rollout mechanism for iterative context compression and memory consolidation, and outcome-based reward assignment.

RL Pipeline (top): MEM1 interacts with an environment over multiple turns. At each turn, the agent issues a Query, updates its Internal State (IS), and receives new Info. Rewards are computed from the final answer and used to train both the Actor (policy) and Critic (value) models using RL objectives.
Memory Consolidation (bottom left): MEM1 keeps context size constant by pruning old states after each turn. All prior knowledge is compressed into the latest Internal State, so only the latest internal state, query, and info are kept in memory. This design forces the agent to continuously integrate, update, and compress knowledge at every step.
Masked Policy Optimization (bottom right): During training, a 2D attention mask ensures that each new internal state can only access the question and most recent context as earlier information is masked out. An Info Mask further restricts gradient updates to only tokens generated by the model, not environment feedback.

Key innovation: MEM1 learns to treat “what to remember” as part of its reasoning. This allows strong multi-turn performance with constant memory, enabling efficient long-horizon reasoning.

Results on Deep Research & Web Agent Tasks

MEM1 on Deep Research Tasks (Multi-Objective Multi-Hop)

Scaling of performance and memory with number of objectives

Table: Multi-objective QA results

To rigorously evaluate agent performance on long-horizon search tasks, we construct compositional multi-objective multi-hop QA tasks, where each instance requires answering several questions in a single trajectory. As the number of questions increases, these tasks place growing demands on memory and reasoning. While existing agents exhibit nearly linear growth in peak memory usage, MEM1 maintains a near-constant memory footprint by iteratively compressing and consolidating its context. Notably, while MEM1 initially underperforms Qwen2.5-14B-Instruct, its accuracy steadily improves as the number of objectives increases, eventually surpassing the 14B model, which has twice the parameter count. In the 16-objective setting, MEM1 achieves higher accuracy while requiring only 27.1% of the peak tokens and 29.3% of the total inference time compared to Qwen2.5-14B-Instruct. When the number of objectives exceeds eight, most baseline methods experience sharp declines in accuracy or even collapse entirely, whereas MEM1 preserves both strong accuracy and efficiency.

MEM1 on Long-Horizon Web Navigation (WebShop)

Table: WebShop results

When evaluated in the WebShop environment—an open-ended, long-horizon web navigation benchmark—MEM1 achieves state-of-the-art performance and remarkable memory and compute efficiency. MEM1 outperforms previous agent methods and even matches or exceeds the best large models (including 13B+ parameters and GPT-4o) in final reward, while reducing peak memory usage and computational dependency by over 2.5x.

Emergent Agent Behaviors

Through analyzing MEM1’s multi-turn interactions on challenging, two-objective question-answering tasks, we observe a range of emergent behaviors that go well beyond simple information retrieval. MEM1 learns to:

Manage multiple objectives concurrently by structuring and updating its internal memory for each question, flexibly shifting focus when progress stalls.
Interleave reasoning and memory, weaving together what it’s found so far with ongoing decision-making, so it always acts on the most relevant information.
Extract and prioritize key details from previous results to shape future searches, and selectively update memory as new facts arrive.
Deploy robust search strategies such as planning, breaking down complex queries, verifying earlier answers, and adjusting queries if retrieval fails—behaviors that echo those of expert human researchers.

The following snippets of internal states and actions showing MEM1's emergent behaviors in 2-objective QA tasks.

Emergent Behavior

Light Blue denotes behaviors related to multi-objective tasks. Beige denotes behaviors related to memory in internal state. Pastel Green denotes behaviors related to general search strategies.

BibTeX

@misc{zhou2025mem1learningsynergizememory,
  title={MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents}, 
  author={Zijian Zhou and Ao Qu and Zhaoxuan Wu and Sunghwan Kim and Alok Prakash and Daniela Rus and Jinhua Zhao and Bryan Kian Hsiang Low and Paul Pu Liang},
  year={2025},
  eprint={2506.15841},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.15841}, 
}