Training MEM1
We train MEM1 via a reinforcement learning framework that integrates a specialized attention masking scheme tailored for dynamic memory, a custom multi-turn rollout mechanism for iterative context compression and memory consolidation, and outcome-based reward assignment.
- RL Pipeline (top): MEM1 interacts with an environment over multiple turns. At each turn, the agent issues a Query, updates its Internal State (IS), and receives new Info. Rewards are computed from the final answer and used to train both the Actor (policy) and Critic (value) models using RL objectives.
- Memory Consolidation (bottom left): MEM1 keeps context size constant by pruning old states after each turn. All prior knowledge is compressed into the latest Internal State, so only the latest internal state, query, and info are kept in memory. This design forces the agent to continuously integrate, update, and compress knowledge at every step.
- Masked Policy Optimization (bottom right): During training, a 2D attention mask ensures that each new internal state can only access the question and most recent context as earlier information is masked out. An Info Mask further restricts gradient updates to only tokens generated by the model, not environment feedback.
Key innovation: MEM1 learns to treat “what to remember” as part of its reasoning. This allows strong multi-turn performance with constant memory, enabling efficient long-horizon reasoning.