hub

Adversarial eval

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, Yanlin Wang · 2026 · Proceedings of the AAAI Conference on Artificial Intelligence · DOI 10.1609/aaai.v38i17.29946

30 Pith papers cite this work, alongside 124 external citations. Polarity classification is still indexing.

30 Pith papers citing it

124 external citations · Crossref

open at publisher browse 30 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 4

citation-polarity summary

background 3 support 1

representative citing papers

Self-GC: Self-Governing Context for Long-Horizon LLM Agents

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 3 refs

MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

cs.CL · 2026-02-02 · unverdicted · novelty 7.0

xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

cs.AI · 2025-09-29 · conditional · novelty 7.0

ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

cs.CL · 2024-10-14 · unverdicted · novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

The paper creates the WorldLines benchmark for long-horizon embodied household tasks and proposes ObsMem as an observer-grounded memory architecture that maintains visibility-aware state trails.

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

cs.AI · 2026-06-04 · unverdicted · novelty 6.0

MRAgent combines a Cue-Tag-Content associative graph with active reconstruction to enable dynamic memory access in LLM agents, reporting up to 23% gains on long-memory benchmarks with lower token costs.

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.

Memory Shot for Long-Term Dialogue

cs.IR · 2026-05-30 · unverdicted · novelty 6.0

MemShot renders local dialogue spans as structured visual memory units to improve long-term dialogue modeling in LLMs, achieving competitive benchmark performance with 70x faster memory construction.

Eywa: Provenance-Grounded Long-Term Memory for AI Agents

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

Eywa introduces a provenance-grounded memory system for persistent AI agents featuring evidence-first storage, typed validation, and deterministic multi-route retrieval, reporting 90.19% accuracy on LoCoMo and 88.2% on LongMemEval-S.

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API token cost.

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

cs.CL · 2026-05-15 · unverdicted · novelty 6.0

H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.

Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Goal-Mem decomposes user goals into subgoals for targeted memory retrieval using Natural Language Logic, improving performance on multi-hop reasoning tasks in conversational agents.

PREPING: Building Agent Memory without Tasks

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Preping builds agent memory via proposer-guided synthetic practice and selective validation, matching offline/online methods at 2-3x lower deployment cost.

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

cs.SE · 2026-05-02 · unverdicted · novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.

MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.

CL-bench Life: Can Language Models Learn from Real-Life Context?

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

cs.CL · 2025-09-22 · unverdicted · novelty 6.0

EpiCache clusters long conversation history into coherent episodes for per-episode KV cache eviction, delivering up to 30% accuracy gains and 3.7x peak memory reduction on LongConvQA tasks under fixed budgets.

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

cs.AI · 2025-09-02 · accept · novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

cs.CL · 2026-06-16 · unverdicted · novelty 5.0

CoreMem replaces cosine retrieval with Fisher-Rao Riemannian matching and introduces Fisher-guided discrete token distillation for syntax-aware compression, reporting +4.51 pp open-domain and +4.17 pp temporal gains on LOCOMO and LongMemEval-S while staying inside an 8 GB VRAM budget.

G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents

cs.CL · 2026-06-11 · unverdicted · novelty 5.0

G-Long uses graph-enhanced triplet memory and attention-aware scoring from a T5 summarizer to achieve up to 9.8% better response quality on MSC and 40.8% better retrieval recall on LME with lower overhead.

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

cs.AI · 2026-06-09 · unverdicted · novelty 5.0

ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.

citing papers explorer

Showing 30 of 30 citing papers.

Self-GC: Self-Governing Context for Long-Horizon LLM Agents cs.AI · 2026-07-01 · unverdicted · none · ref 10
Self-GC governs agent context as indexed objects with planner-proposed actions, achieving 84.85% no-impact on future continuations on a hard set versus 54-70% for baselines.
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs cs.AI · 2026-05-08 · unverdicted · none · ref 10 · 3 links
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 141
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
OCR-Memory: Optical Context Retrieval for Long-Horizon Agent Memory cs.CL · 2026-04-29 · unverdicted · none · ref 30
OCR-Memory encodes agent trajectories as images with visual anchors and retrieves verbatim text via locate-and-transcribe, yielding gains on long-horizon benchmarks under strict context limits.
Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation cs.CL · 2026-02-02 · unverdicted · none · ref 4
xMemory builds revisable hierarchical agent memory by segmenting histories, decoupling into components, and aggregating via sparsity-semantic objective, yielding better answer quality and lower token use than flat RAG on LoCoMo and PerLTQA.
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory cs.AI · 2025-09-29 · conditional · none · ref 29
ReasoningBank distills generalizable reasoning strategies from agent successes and failures to enable self-evolution, with memory-aware test-time scaling amplifying gains over raw-trajectory or success-only memory on web and software benchmarks.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory cs.CL · 2024-10-14 · unverdicted · none · ref 108
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents cs.AI · 2026-06-17 · unverdicted · none · ref 12
The paper creates the WorldLines benchmark for long-horizon embodied household tasks and proposes ObsMem as an observer-grounded memory architecture that maintains visibility-aware state trails.
Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents cs.AI · 2026-06-04 · unverdicted · none · ref 63
MRAgent combines a Cue-Tag-Content associative graph with active reconstruction to enable dynamic memory access in LLM agents, reporting up to 23% gains on long-memory benchmarks with lower token costs.
AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents cs.CL · 2026-06-04 · unverdicted · none · ref 27
AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.
Memory Shot for Long-Term Dialogue cs.IR · 2026-05-30 · unverdicted · none · ref 48
MemShot renders local dialogue spans as structured visual memory units to improve long-term dialogue modeling in LLMs, achieving competitive benchmark performance with 70x faster memory construction.
Eywa: Provenance-Grounded Long-Term Memory for AI Agents cs.CL · 2026-05-29 · unverdicted · none · ref 25
Eywa introduces a provenance-grounded memory system for persistent AI agents featuring evidence-first storage, typed validation, and deterministic multi-route retrieval, reporting 90.19% accuracy on LoCoMo and 88.2% on LongMemEval-S.
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA cs.CL · 2026-05-21 · unverdicted · none · ref 59
DeferMem decouples memory QA into high-recall retrieval and RL-based query-conditioned evidence distillation, outperforming baselines on LoCoMo and LongMemEval-S with highest accuracy, fastest runtime, and zero API token cost.
H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure cs.CL · 2026-05-15 · unverdicted · none · ref 12
H-Mem introduces a hybrid tree-plus-graph memory mechanism that evolves short-term agent memories into long-term summaries and enables efficient retrieval, reporting state-of-the-art QA results on three benchmarks.
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems cs.AI · 2026-05-12 · unverdicted · none · ref 42 · 2 links
Goal-Mem decomposes user goals into subgoals for targeted memory retrieval using Natural Language Logic, improving performance on multi-hop reasoning tasks in conversational agents.
PREPING: Building Agent Memory without Tasks cs.AI · 2026-05-11 · unverdicted · none · ref 39
Preping builds agent memory via proposer-guided synthetic practice and selective validation, matching offline/online methods at 2-3x lower deployment cost.
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture cs.SE · 2026-05-02 · unverdicted · none · ref 39
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full hard-negative suppression on a 200-case benchmark.
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents cs.CL · 2026-05-01 · unverdicted · none · ref 26
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
CL-bench Life: Can Language Models Learn from Real-Life Context? cs.CL · 2026-04-29 · unverdicted · none · ref 80
CL-bench Life shows frontier language models achieve only 13.8% average success on real-life context tasks, with the best model at 19.3%.
EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments cs.CL · 2025-09-22 · unverdicted · none · ref 43
EpiCache clusters long conversation history into coherent episodes for per-episode KV cache eviction, delivering up to 30% accuracy gains and 3.7x peak memory reduction on LongConvQA tasks under fixed budgets.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey cs.AI · 2025-09-02 · accept · none · ref 130
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents cs.CL · 2026-06-16 · unverdicted · none · ref 32
CoreMem replaces cosine retrieval with Fisher-Rao Riemannian matching and introduces Fisher-guided discrete token distillation for syntax-aware compression, reporting +4.51 pp open-domain and +4.17 pp temporal gains on LOCOMO and LongMemEval-S while staying inside an 8 GB VRAM budget.
G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents cs.CL · 2026-06-11 · unverdicted · none · ref 2
G-Long uses graph-enhanced triplet memory and attention-aware scoring from a T5 summarizer to achieve up to 9.8% better response quality on MSC and 40.8% better retrieval recall on LME with lower overhead.
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning cs.AI · 2026-06-09 · unverdicted · none · ref 34
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning cs.CL · 2026-05-28 · unverdicted · none · ref 45
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.
LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 233 · 2 links
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.
Would You Marry Superintelligence? cs.CY · 2026-06-30 · unverdicted · none · ref 5
Granting marital status to superintelligent AI leads to unjust outcomes; targeted legal protections for human-AI relationships are preferable.
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI cs.AI · 2026-06-05 · unverdicted · none · ref 132
Explicit memory modeled on the hippocampus is the cornerstone needed to advance LLMs to AGI because their implicit statistical learning cannot produce higher cognitive functions.
Rethinking Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 129 · 3 links
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
Towards trustworthy agentic AI: a comprehensive survey of safety, robustness, privacy, and system security cs.AI · 2026-05-17 · unverdicted · none · ref 200
A survey that maps risks along the agent workflow and consolidates metrics and benchmarks for safety, robustness, privacy, and security in agentic AI.

Adversarial eval

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer