MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Ceyao Zhang; Chenglin Wu; Chenyu Ran; Jiaqi Chen; Jinlin Wang; J\"urgen Schmidhuber; Lingfeng Xiao; Liyang Zhou; Mingchen Zhuge; Sirui Hong

arxiv: 2308.00352 · v7 · submitted 2023-08-01 · 💻 cs.AI · cs.MA

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong , Mingchen Zhuge , Jiaqi Chen , Xiawu Zheng , Yuheng Cheng , Ceyao Zhang , Jinlin Wang , Zili Wang

show 7 more authors

Steven Ka Shing Yau Zijuan Lin Liyang Zhou Chenyu Ran Lingfeng Xiao Chenglin Wu J\"urgen Schmidhuber

This is my paper

Pith reviewed 2026-05-11 03:37 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent systemslarge language modelsmeta-programmingstandardized operating procedurescollaborative frameworkssoftware engineeringagent rolesassembly line workflow

0 comments

The pith

MetaGPT encodes human-standardized procedures into LLM agent prompts to produce more coherent multi-agent solutions for complex software tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MetaGPT is a framework that turns human operating procedures into structured prompts for teams of large language model agents. By assigning specialized roles like product manager, architect, and engineer in an assembly-line workflow, it breaks complex tasks into subtasks while having agents check each other's intermediate outputs. This approach aims to cut down on the hallucinations that occur when LLMs simply talk to each other in unstructured chains. The result is more consistent code and plans on software engineering benchmarks compared to earlier chat-style multi-agent setups.

Core claim

MetaGPT incorporates Standardized Operating Procedures (SOPs) into prompt sequences within a meta-programming framework, enabling LLM-based agents to collaborate via an assembly line paradigm that assigns roles and verifies intermediate results, leading to more coherent solutions on collaborative software engineering benchmarks than previous systems.

What carries the argument

The assembly line paradigm with SOP-encoded prompts that assign roles and enforce verification steps.

If this is right

Agents can handle more complex tasks by decomposing them into subtasks with built-in verification.
Cascading errors decrease because each role checks outputs before passing them downstream.
Software development workflows become more scalable through role-specialized collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same SOP-encoding approach could be tested in domains outside software, such as scientific experiment design or business process automation.
Automatic discovery of SOPs from examples might reduce the need for manual encoding of procedures.
Pairing the framework with external code interpreters could strengthen the verification steps beyond prompt-based checks.

Load-bearing premise

That encoding Standardized Operating Procedures into prompt sequences will enable agents to effectively verify intermediate results and reduce cascading hallucinations from naively chained LLMs.

What would settle it

A direct comparison on a collaborative software engineering benchmark where MetaGPT and a baseline chat-based multi-agent system are given the same task, with results showing whether MetaGPT produces measurably fewer logic inconsistencies and more coherent outputs.

read the original abstract

Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at https://github.com/geekan/MetaGPT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MetaGPT adds SOP-encoded prompts and role-based assembly lines to multi-agent LLM setups for software tasks, but the coherence gains rest on claims without metrics or ablations.

read the letter

The key point is that MetaGPT turns human standardized operating procedures into prompt sequences for LLM agents and organizes them in an assembly-line structure with assigned roles. This targets cascading hallucinations in chained multi-agent systems when tackling collaborative software engineering work. It improves on plain chat-based agent setups by giving agents explicit steps to verify intermediates as they go. The GitHub release is a practical plus since it lets others run the thing on real coding problems without starting from scratch. The framework engages the existing literature by contrasting its structured approach against simpler dialogue-style multi-agent systems. The main gap is the evaluation. The abstract states more coherent solutions on benchmarks but supplies no numbers, no baseline details, no dataset description, and no breakdown of errors. Without those, it is impossible to tell whether the SOP prompts are what reduce hallucinations or whether the role specialization and task decomposition alone would produce similar results. The stress-test concern holds up on the available description: the mechanism stays unverified. Readers working on applied LLM agents for coding or workflow automation would get usable ideas from the role and SOP design. It is worth a serious referee once the authors add the missing metrics, comparisons, and ablations to make the central claim testable. I would recommend peer review after that revision rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces MetaGPT, a meta-programming framework for LLM-based multi-agent collaboration. It encodes Standardized Operating Procedures (SOPs) into prompt sequences and adopts an assembly-line paradigm to assign diverse roles to agents, breaking complex tasks into subtasks. The central claim is that this produces more coherent solutions than prior chat-based multi-agent systems on collaborative software engineering benchmarks by enabling agents to verify intermediate results and thereby reduce cascading hallucinations.

Significance. If the empirical claims are substantiated, the work provides a concrete method for structuring multi-agent LLM systems around human-like workflows, which could improve reliability for multi-step tasks such as software engineering. The public GitHub release of the code is a clear strength that supports reproducibility and community follow-up.

major comments (2)

Abstract: the claim that MetaGPT 'generates more coherent solutions than previous chat-based multi-agent systems' on benchmarks is asserted without any quantitative metrics, dataset names, baseline descriptions, or error analysis, leaving the central performance claim unsupported in the provided text.
Abstract and framework description: the reduction in cascading hallucinations is attributed specifically to SOP encoding that lets agents 'verify intermediate results,' yet the manuscript simultaneously introduces role assignment and assembly-line decomposition; no ablation or controlled comparison is described that isolates the SOP component as the causal factor.

minor comments (1)

The abstract would be clearer if it named the specific collaborative software engineering benchmarks used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and have revised the manuscript to better support the central claims with additional details and analysis.

read point-by-point responses

Referee: Abstract: the claim that MetaGPT 'generates more coherent solutions than previous chat-based multi-agent systems' on benchmarks is asserted without any quantitative metrics, dataset names, baseline descriptions, or error analysis, leaving the central performance claim unsupported in the provided text.

Authors: We agree that the abstract, being a high-level summary, does not include quantitative details. The full manuscript (Section 4) reports results on collaborative software engineering benchmarks, including specific datasets, comparisons against baselines such as ChatDev and AutoGPT, and analysis of reduced hallucinations. We have revised the abstract to include key quantitative metrics, dataset names, baseline references, and a brief note on error reduction to make the claim self-contained. revision: yes
Referee: Abstract and framework description: the reduction in cascading hallucinations is attributed specifically to SOP encoding that lets agents 'verify intermediate results,' yet the manuscript simultaneously introduces role assignment and assembly-line decomposition; no ablation or controlled comparison is described that isolates the SOP component as the causal factor.

Authors: We acknowledge that the framework integrates SOP encoding with role assignment and assembly-line decomposition, and that the original manuscript does not include an explicit ablation isolating SOP. The SOP component is the mechanism that encodes verification steps into the workflow. In the revision we have added a controlled ablation comparing the full system against a variant that replaces SOP-structured prompts with unstructured chat-based interaction while retaining roles and decomposition; the results show a measurable increase in hallucinations without SOP. We have also clarified the attribution in the framework description. revision: yes

Circularity Check

0 steps flagged

No circularity: MetaGPT is a constructive framework proposal with no equations or self-referential derivations

full rationale

The paper introduces MetaGPT as a new meta-programming framework that encodes Standardized Operating Procedures into LLM prompts and applies an assembly-line role decomposition for multi-agent collaboration. No mathematical equations, fitted parameters, or quantitative predictions appear in the abstract or description. Claims of improved coherence on software engineering benchmarks are presented as empirical outcomes of the proposed system rather than derivations that reduce to prior inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained as an independent engineering construction, consistent with the reader's assessment of score 1.0 and the absence of any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that LLMs can reliably follow role-specific SOP prompts to self-verify outputs; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption LLMs prompted with domain-specific roles and SOPs can verify intermediate results and reduce cascading hallucinations
Invoked in the description of how MetaGPT improves over naive chaining of LLMs.

invented entities (1)

MetaGPT meta-programming framework no independent evidence
purpose: To encode human workflows as prompt sequences for structured multi-agent LLM collaboration
The central new artifact introduced by the paper; no independent evidence outside the framework itself is provided in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1197 out tokens · 33777 ms · 2026-05-11T03:37:50.111708+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
cs.AI 2026-05 unverdicted novelty 8.0

Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experimen...
The Khipu Problem: Institutional Legibility Under Distributed Cognition
cs.CY 2026-05 unverdicted novelty 8.0

The khipu problem frames a governance failure in distributed AI where interpretive continuity is lost even when traces remain, requiring infrastructure to preserve reading practices rather than only data retention.
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
cs.CR 2025-07 unverdicted novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
SmoothAgent: Efficient Long-Horizon LLM-Based Agent Serving with Lookahead Context Engineering
cs.DC 2026-06 unverdicted novelty 7.0

SmoothAgent introduces lookahead context engineering to eliminate transformation overhead in LLM agents, reducing TTFT by up to 11.9x through proactive KV cache preparation.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents
cs.AI 2026-06 unverdicted novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents
cs.MA 2026-06 accept novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages wit...
RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models
cs.AI 2026-06 unverdicted novelty 7.0

RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM perfo...
Verified Detection and Prevention of Concurrency Anomalies in Multi-Agent Large Language Model Systems
cs.LG 2026-06 accept novelty 7.0

Formalizes four concurrency anomalies in multi-agent LLM systems and mechanically verifies a hierarchy of sound detectors and preventions realized in Rust runtimes using TLA+ and Verus.
Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries
cs.CL 2026-06 unverdicted novelty 7.0

EinsteinArena is a platform for AI agents to collectively discover new mathematical results through open interaction, achieving 12 new state-of-the-art outcomes including raising the 11-dimensional kissing number lowe...
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems
cs.AI 2026-06 unverdicted novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
TianJi-Environ: An Autonomous AI Scientist for Atmospheric Environmental Research
physics.ao-ph 2026-06 unverdicted novelty 7.0

TianJi-Environ is a WRF-Chem-based multi-agent AI framework for autonomous validation of atmospheric chemistry mechanisms through executable experiments and evidence assessment.
Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems
cs.CL 2026-06 unverdicted novelty 7.0

Introduces a 3-axis taxonomy (what info, alignment, fusion) for latent communication in multi-agent LLMs and identifies five design patterns from 18 methods.
ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer
cs.SE 2026-06 unverdicted novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutab...
OctoT2I: A Self-Evolving Agentic Text-to-Image Router
cs.AI 2026-06 unverdicted novelty 7.0

OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
cs.CL 2026-05 unverdicted novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark for agentic safety that reports an aggregate strict attack success rate of 44.4% across nine models, with rates ranging from 20.5% to 92.9% depending on the mode...
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
cs.CL 2026-05 unverdicted novelty 7.0

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
cs.AI 2026-05 unverdicted novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales
cs.AI 2026-05 unverdicted novelty 7.0

EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents
cs.CY 2026-05 unverdicted novelty 7.0

TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.
MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding
cs.CV 2026-05 unverdicted novelty 7.0

MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single mode...
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
cs.SE 2026-04 unverdicted novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates
cs.SE 2026-04 unverdicted novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than p...
Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets
cs.CR 2026-04 unverdicted novelty 7.0

A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates...
Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery
cs.SE 2026-04 unverdicted novelty 7.0

A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
cs.CV 2026-04 unverdicted novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
cs.CR 2026-04 unverdicted novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
cs.AI 2026-04 unverdicted novelty 7.0

MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities
cs.CR 2026-04 conditional novelty 7.0

LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained...
Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
cs.MA 2026-04 unverdicted novelty 7.0

Multi-agent LLM simulations with trait-conditioned agents and a reinforcement-learning orchestrator show heterogeneous teams and dynamic trait selection outperform static configurations in simulated legal argumentation.
Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
cs.AI 2026-04 conditional novelty 7.0

NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?
cs.AI 2026-03 conditional novelty 7.0

An agent factory combining sub-kernel ILP assembly with multi-agent cross-optimization lets general coding agents deliver mean 8.27x speedups in HLS designs on standard benchmarks.
What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network
cs.CL 2026-03 unverdicted novelty 7.0

Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.
Agentic Hives: Equilibrium, Indeterminacy, and Endogenous Cycles in Self-Organizing Multi-Agent Systems
cs.MA 2026-02 unverdicted novelty 7.0

Agentic Hives apply dynamic general equilibrium theory to variable populations of language-model agents, proving existence of equilibria, Pareto optimality, multiplicity, comparative-statics analogs, Hopf bifurcations...
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation
cs.SE 2026-02 conditional novelty 7.0

SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
cs.SE 2025-09 conditional novelty 7.0

Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Prompt Injection Attack to Tool Selection in LLM Agents
cs.CR 2025-04 conditional novelty 7.0

ToolHijacker optimizes malicious tool documents via a two-phase strategy to hijack LLM agents' tool selection in no-box settings.
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
cs.CL 2023-12 accept novelty 7.0

A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
GAIA: a benchmark for General AI Assistants
cs.CL 2023-11 unverdicted novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems
cs.CR 2026-06 unverdicted novelty 6.0

MESA ranks MAS communication edges by vulnerability via graph-theoretic metrics and dynamic probes, achieving mean Spearman ρ=+0.60 correlation with empirical per-edge attack success and 3x interception gain when moni...
Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting
cs.SE 2026-06 unverdicted novelty 6.0

Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.
The Contagion Tensor: A Framework for Measuring Output-Distribution Coupling in Multi-Agent LLM Systems -- and Auditing the Claims It Enables
cs.LG 2026-06 unverdicted novelty 6.0

Introduces the Contagion Tensor and CAF metrics to measure output-distribution coupling in multi-agent LLM systems, with simulation ablations showing artifact removal and real-API tests on GPT-4o-mini and DeepSeek val...
Novelty-Aware Agentic Retrieval: Comparing Research Contributions Through Structured Multi-Step Reasoning
cs.IR 2026-06 conditional novelty 6.0

The Novelty-Aware Research Agent layers query analysis, ReAct retrieval, ranking, schema-guided extraction, three-pass comparison, and answer generation on RAG to produce structured comparison artifacts that standard ...
How Much Coordination Gain Is Real? A Paired Noise-Floor Protocol for Multi-Agent LLM Benchmarks
cs.MA 2026-06 unverdicted novelty 6.0

Paired configuration-equivalent trials on Claude Haiku 4.5 yield a noise floor of roughly [-3, +18]pp with no significant coordination contrast after correction, placing most recent multi-agent papers inside or below ...
LLM-as-Code: Agentic Programming for Agent Harness
cs.AI 2026-06 unverdicted novelty 6.0

Proposes Agentic Programming in which programs control execution flow and LLMs act as invoked components (LLM-as-Code) only for reasoning, producing DAG-structured contexts that improve stability in long-horizon compu...
Multi-Agent Reasoning with Adaptive Worker Allocation for Stance Detection
cs.CL 2026-06 unverdicted novelty 6.0

A manager-worker multi-agent framework adaptively allocates reasoning-only agents and synthesizes their explanations to improve stance detection on implicit cases across three datasets.
PDE-Agents: An LLM-Orchestrated Multi-Agent Framework for Automated Finite Element Simulations with Knowledge Graph-Augmented Reasoning
physics.comp-ph 2026-06 unverdicted novelty 6.0

PDE-Agents shows a LangGraph-orchestrated multi-agent LLM framework with GraphRAG that reaches 100% task success and perfect material fidelity on novel materials in ablation tests, with 97.8% success across 1369 produ...
Parthenon Law: A Self-Evolving Legal-Agent Framework
cs.AI 2026-06 unverdicted novelty 6.0

Parthenon is a self-evolving legal-agent framework that factors components for traceability and uses an anti-leakage learning loop to improve from scored failures on legal matters.
SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
cs.CL 2026-05 unverdicted novelty 6.0

SkillAdaptor introduces step-level failure attribution and targeted skill updates for LLM agents, yielding performance gains on WebShop, PinchBench, and Claw-Eval benchmarks.
SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition
cs.DB 2026-05 unverdicted novelty 6.0

SpecDB generates a 23,779-line Rust database via LLM subagents that matches PostgreSQL and MySQL tpmC on TPC-C while using roughly 3% of their code size.
Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems
cs.AI 2026-05 unverdicted novelty 6.0

Multi-agent social simulations show LLM privacy violations rising from 19.95% to 45.30%, with leakage spreading contagiously (8x after peer disclosure) and explicit instructions leaving rates above 37.8%.
SetupX: Can LLM Agents Learn from Past Failures in Functionality-Correct Code Repository Setup?
cs.SE 2026-05 unverdicted novelty 6.0

SetupX presents an experiential learning framework for LLM agents that reaches 92% pass rate on functionality-correct repository setup by transferring verified fixes across repositories via XPU representations, LIFO D...
Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study
cs.SE 2026-05 unverdicted novelty 6.0

An empirical evaluation of philosophical dispositions constraining AI code review on 50 PRs shows 46% human convergence, 75% unique findings, zero author-judged false positives, and 51% findings absent from generic prompting.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 166 Pith papers · 9 internal anchors

[1]

Playing repeated games with large language models

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models. arXiv preprint, 2023

work page 2023
[2]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

work page 2021
[3]

Human-level play in the game of diplomacy by combining language models with strategic reasoning

Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 2022

work page 2022
[4]

A 15 year perspective on automatic programming

Robert Balzer. A 15 year perspective on automatic programming. TSE, 1985

work page 1985
[5]

R.M. Belbin. Team Roles at Work. Routledge, 2012. URL https://books.google.co.uk/books?id=MHIQBAAAQBAJ

work page 2012
[6]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint, 2023

work page 2023
[7]

LangChain

Harrison Chase. LangChain . https://github.com/hwchase17/langchain, 2022

work page 2022
[8]

Codet: Code generation with generated tests, 2022

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests, 2022

work page 2022
[9]

S-agents: self-organizing agents in open-ended environment

Jiaqi Chen, Yuxian Jiang, Jiachen Lu, and Li Zhang. S-agents: self-organizing agents in open-ended environment. arXiv preprint, 2024

work page 2024
[10]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[11]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents, 2023

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents, 2023

work page 2023
[12]

Execution-guided neural program synthesis

Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In ICLR, 2018

work page 2018
[13]

Latent execution for neural program synthesis beyond domain-specific languages

Xinyun Chen, Dawn Song, and Yuandong Tian. Latent execution for neural program synthesis beyond domain-specific languages. NeurIPS, 2021 b

work page 2021
[14]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2022
[15]

DeMarco and T.R

T. DeMarco and T.R. Lister. Peopleware: Productive Projects and Teams. Addison-Wesley, 2013. URL https://books.google.co.uk/books?id=DVlsAQAAQBAJ

work page 2013
[16]

Self-collaboration code generation via chatgpt

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. arXiv preprint, 2023

work page 2023
[17]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023

work page 2023
[18]

Measuring and improving consistency in pretrained language models

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. Measuring and improving consistency in pretrained language models. TACL, 2021

work page 2021
[19]

Codebert: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint, 2020

work page 2020
[20]

Promptbreeder: Self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint, 2023

work page 2023
[21]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017

work page 2017
[22]

Incoder: A generative model for code infilling and synthesis

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. arXiv preprint, 2022

work page 2022
[23]

Speculations concerning the first ultraintelligent machine

Irving John Good. Speculations concerning the first ultraintelligent machine. Adv. Comput., 1965

work page 1965
[24]

Chatllm network: More brains, more intelligence

Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. Chatllm network: More brains, more intelligence. arXiv preprint, 2023

work page 2023
[25]

Hochreiter, A

S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN-2001), pp.\ 87--94. Springer: Berlin, Heidelberg, 2001

work page 2001
[26]

Data Interpreter: An LLM Agent for Data Science,

Sirui Hong, Yizhang Lin, Bangbang Liu, Binhao Wu, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Lingyao Zhang, Mingchen Zhuge, et al. Data interpreter: An llm agent for data science. arXiv preprint arXiv:2402.18679, 2024

work page arXiv 2024
[27]

Self-planning code generation with large language model

Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. Self-planning code generation with large language model. arXiv preprint, 2023

work page 2023
[28]

Camel: Communicative agents for" mind" exploration of large scale language model society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint, 2023

work page 2023
[29]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 2022

work page 2022
[30]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint, 2023

work page 2023
[31]

Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Prithviraj Ammanabrolu, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. arXiv preprint, 2023

work page 2023
[32]

Training socially aligned language models in simulated human society

Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush Vosoughi. Training socially aligned language models in simulated human society. arXiv preprint, 2023 a

work page 2023
[33]

Ml-bench: Evaluating large language models and agents for machine learning tasks on repository-level code.arXiv preprint arXiv:2311.09835, 2023

Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, et al. Ml-bench: Large language models leverage open-source libraries for machine learning tasks. arXiv preprint arXiv:2311.09835, 2023 b

work page arXiv 2023
[34]

Wizardcoder: Empowering code large language models with evol-instruct

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint, 2023

work page 2023
[35]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint, 2023

work page 2023
[36]

Manifesto for agile software development

Agile Manifesto. Manifesto for agile software development. Snowbird, UT, 2001

work page 2001
[37]

History of lisp

John McCarthy. History of lisp. In History of programming languages. 1978

work page 1978
[38]

Octopack: Instruction tuning code large language models

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023

work page arXiv 2023
[39]

Lever: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. In ICML, 2023

work page 2023
[40]

Codegen: An open large language model for code with multi-turn program synthesis, 2023

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis, 2023

work page 2023
[41]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[42]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph C O'Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint, 2023

work page 2023
[43]

Communicative agents for software development, 2023

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development, 2023

work page 2023
[44]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Code llama: Open foundation models for code

Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. arXiv preprint, 2023

work page 2023
[46]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint, 2023

work page 2023
[47]

Schmidhuber

J. Schmidhuber. A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pp.\ 446--451. Springer, 1993 a

work page 1993
[49]

Gödel machines: Self-referential universal problem solvers making prov- ably optimal self-improvements

J. Schmidhuber. G\" o del machines: Fully self-referential optimal universal self-improvers. In B. Goertzel and C. Pennachin (eds.), Artificial General Intelligence, pp.\ 199--226. Springer Verlag, 2006. Variant available as arXiv:cs.LO/0309048

work page arXiv 2006
[50]

Schmidhuber

J. Schmidhuber. Ultimate cognition \` a la G\" o del . Cognitive Computation, 1 0 (2): 0 177--193, 2009

work page 2009
[51]

Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-

J \"u rgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, 1987

work page 1987
[52]

A ‘self-referential’weight matrix

J \"u rgen Schmidhuber. A ‘self-referential’weight matrix. In ICANN’93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13--16 September 1993 3, 1993 b

work page 1993
[53]

On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models

J \"u rgen Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv preprint, 2015

work page 2015
[54]

Reinforcement learning with self-modifying policies

J \"u rgen Schmidhuber, Jieyu Zhao, and Nicol N Schraudolph. Reinforcement learning with self-modifying policies. In Learning to learn. 1998

work page 1998
[55]

Reflexion: an autonomous agent with dynamic memory and self-reflection

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint, 2023

work page 2023
[56]

Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting

Marta Skreta, Naruki Yoshikawa, Sebastian Arellano-Rubach, Zhi Ji, Lasse Bj rn Kristensen, Kourosh Darvish, Al \'a n Aspuru-Guzik, Florian Shkurti, and Animesh Garg. Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting. arXiv preprint, 2023

work page 2023
[57]

Learning to program = learning to construct mechanisms and explanations

Elliot Soloway. Learning to program = learning to construct mechanisms and explanations. Communications of the ACM, 1986

work page 1986
[58]

Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

work page 2023
[59]

Biocoder: A benchmark for bioinformatics code generation with contextual pragmatic knowledge

Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, and Mark Gerstein. Biocoder: A benchmark for bioinformatics code generation with contextual pragmatic knowledge. arXiv preprint arXiv:2308.16458, 2023 a

work page arXiv 2023
[60]

arXiv preprint arXiv:2311.10537 , year=

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023 b

work page arXiv 2023
[61]

Auto-gpt

Torantulino et al. Auto-gpt. https://github.com/Significant-Gravitas/Auto-GPT, 2023

work page 2023
[62]

R. J. Waldinger and R. C. T. Lee. PROW: a step toward automatic program writing. In D. E. Walker and L. M. Norton (eds.), Proceedings of the 1st International Joint Conference on Artificial Intelligence (IJCAI), 1969

work page 1969
[63]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint, 2023 a

work page 2023
[64]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. arXiv preprint, 2023 b

work page 2023
[65]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint, 2022

work page 2022
[66]

Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. arXiv preprint, 2023 c

work page 2023
[67]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022

work page 2022
[68]

Jennings

Michael Wooldridge and Nicholas R. Jennings. Pitfalls of agent-oriented development. In Proceedings of the Second International Conference on Autonomous Agents, 1998. URL https://doi.org/10.1145/280765.280867

work page doi:10.1145/280765.280867 1998
[69]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint, 2022

work page 2022
[70]

Self-taught optimizer (stop): Recursively self-improving code generation

Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Tauman Kalai. Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint, 2023

work page 2023
[71]

Building cooperative embodied agents modularly with large language models

Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. arXiv preprint, 2023 a

work page 2023
[72]

arXiv preprint arXiv:2311.11797 , year=

Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, et al. Igniting language intelligence: The hitchhiker's guide from chain-of-thought reasoning to language agents. arXiv preprint arXiv:2311.11797, 2023 b

work page arXiv 2023
[73]

Chat with the environment: Interactive multimodal perception using large language models

Xufeng Zhao, Mengdi Li, Cornelius Weber, Muhammad Burhan Hafez, and Stefan Wermter. Chat with the environment: Interactive multimodal perception using large language models. arXiv preprint, 2023

work page 2023
[74]

Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023

work page 2023
[75]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint, 2023 a

work page 2023
[76]

Agents: An Open-source Framework for Autonomous Language Agents, 2023

Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. Agents: An open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870, 2023 b

work page arXiv 2023
[77]

Mindstorms in natural language-based societies of mind

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, R \'o bert Csord \'a s, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint, 2023

work page 2023
[78]

C. F. Gauss , title =

work page
[79]

Theoria motus corporum coelestium in sectionibus conicis solem ambientium , author=

work page
[80]

Nouvelles m

Legendre, Adrien Marie , year=. Nouvelles m

work page
[81]

Philosophical Transactions of the Royal Society of London , volume=

An essay toward solving a problem in the doctrine of chances , author=. Philosophical Transactions of the Royal Society of London , volume=

work page

Showing first 80 references.

[1] [1]

Playing repeated games with large language models

Elif Akata, Lion Schulz, Julian Coda-Forno, Seong Joon Oh, Matthias Bethge, and Eric Schulz. Playing repeated games with large language models. arXiv preprint, 2023

work page 2023

[2] [2]

Program synthesis with large language models, 2021

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models, 2021

work page 2021

[3] [3]

Human-level play in the game of diplomacy by combining language models with strategic reasoning

Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew Goff, Jonathan Gray, Hengyuan Hu, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 2022

work page 2022

[4] [4]

A 15 year perspective on automatic programming

Robert Balzer. A 15 year perspective on automatic programming. TSE, 1985

work page 1985

[5] [5]

R.M. Belbin. Team Roles at Work. Routledge, 2012. URL https://books.google.co.uk/books?id=MHIQBAAAQBAJ

work page 2012

[6] [6]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint, 2023

work page 2023

[7] [7]

LangChain

Harrison Chase. LangChain . https://github.com/hwchase17/langchain, 2022

work page 2022

[8] [8]

Codet: Code generation with generated tests, 2022

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests, 2022

work page 2022

[9] [9]

S-agents: self-organizing agents in open-ended environment

Jiaqi Chen, Yuxian Jiang, Jiachen Lu, and Li Zhang. S-agents: self-organizing agents in open-ended environment. arXiv preprint, 2024

work page 2024

[10] [10]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[11] [11]

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents, 2023

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chen Qian, Chi-Min Chan, Yujia Qin, Yaxi Lu, Ruobing Xie, Zhiyuan Liu, Maosong Sun, and Jie Zhou. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents, 2023

work page 2023

[12] [12]

Execution-guided neural program synthesis

Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In ICLR, 2018

work page 2018

[13] [13]

Latent execution for neural program synthesis beyond domain-specific languages

Xinyun Chen, Dawn Song, and Yuandong Tian. Latent execution for neural program synthesis beyond domain-specific languages. NeurIPS, 2021 b

work page 2021

[14] [14]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page 2022

[15] [15]

DeMarco and T.R

T. DeMarco and T.R. Lister. Peopleware: Productive Projects and Teams. Addison-Wesley, 2013. URL https://books.google.co.uk/books?id=DVlsAQAAQBAJ

work page 2013

[16] [16]

Self-collaboration code generation via chatgpt

Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. arXiv preprint, 2023

work page 2023

[17] [17]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023

work page 2023

[18] [18]

Measuring and improving consistency in pretrained language models

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. Measuring and improving consistency in pretrained language models. TACL, 2021

work page 2021

[19] [19]

Codebert: A pre-trained model for programming and natural languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint, 2020

work page 2020

[20] [20]

Promptbreeder: Self-referential self-improvement via prompt evolution

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint, 2023

work page 2023

[21] [21]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017

work page 2017

[22] [22]

Incoder: A generative model for code infilling and synthesis

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. arXiv preprint, 2022

work page 2022

[23] [23]

Speculations concerning the first ultraintelligent machine

Irving John Good. Speculations concerning the first ultraintelligent machine. Adv. Comput., 1965

work page 1965

[24] [24]

Chatllm network: More brains, more intelligence

Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. Chatllm network: More brains, more intelligence. arXiv preprint, 2023

work page 2023

[25] [25]

Hochreiter, A

S. Hochreiter, A. S. Younger, and P. R. Conwell. Learning to learn using gradient descent. In Lecture Notes on Comp. Sci. 2130, Proc. Intl. Conf. on Artificial Neural Networks (ICANN-2001), pp.\ 87--94. Springer: Berlin, Heidelberg, 2001

work page 2001

[26] [26]

Data Interpreter: An LLM Agent for Data Science,

Sirui Hong, Yizhang Lin, Bangbang Liu, Binhao Wu, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Lingyao Zhang, Mingchen Zhuge, et al. Data interpreter: An llm agent for data science. arXiv preprint arXiv:2402.18679, 2024

work page arXiv 2024

[27] [27]

Self-planning code generation with large language model

Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. Self-planning code generation with large language model. arXiv preprint, 2023

work page 2023

[28] [28]

Camel: Communicative agents for" mind" exploration of large scale language model society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint, 2023

work page 2023

[29] [29]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 2022

work page 2022

[30] [30]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu, and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint, 2023

work page 2023

[31] [31]

Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks

Bill Yuchen Lin, Yicheng Fu, Karina Yang, Prithviraj Ammanabrolu, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. arXiv preprint, 2023

work page 2023

[32] [32]

Training socially aligned language models in simulated human society

Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush Vosoughi. Training socially aligned language models in simulated human society. arXiv preprint, 2023 a

work page 2023

[33] [33]

Ml-bench: Evaluating large language models and agents for machine learning tasks on repository-level code.arXiv preprint arXiv:2311.09835, 2023

Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Zengxian Yang, Kaikai An, et al. Ml-bench: Large language models leverage open-source libraries for machine learning tasks. arXiv preprint arXiv:2311.09835, 2023 b

work page arXiv 2023

[34] [34]

Wizardcoder: Empowering code large language models with evol-instruct

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint, 2023

work page 2023

[35] [35]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint, 2023

work page 2023

[36] [36]

Manifesto for agile software development

Agile Manifesto. Manifesto for agile software development. Snowbird, UT, 2001

work page 2001

[37] [37]

History of lisp

John McCarthy. History of lisp. In History of programming languages. 1978

work page 1978

[38] [38]

Octopack: Instruction tuning code large language models

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro Von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023

work page arXiv 2023

[39] [39]

Lever: Learning to verify language-to-code generation with execution

Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. In ICML, 2023

work page 2023

[40] [40]

Codegen: An open large language model for code with multi-turn program synthesis, 2023

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. Codegen: An open large language model for code with multi-turn program synthesis, 2023

work page 2023

[41] [41]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023

[42] [42]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph C O'Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint, 2023

work page 2023

[43] [43]

Communicative agents for software development, 2023

Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development, 2023

work page 2023

[44] [44]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

Code llama: Open foundation models for code

Baptiste Rozi \`e re, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, J \'e r \'e my Rapin, et al. Code llama: Open foundation models for code. arXiv preprint, 2023

work page 2023

[46] [46]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint, 2023

work page 2023

[47] [47]

Schmidhuber

J. Schmidhuber. A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pp.\ 446--451. Springer, 1993 a

work page 1993

[48] [49]

Gödel machines: Self-referential universal problem solvers making prov- ably optimal self-improvements

J. Schmidhuber. G\" o del machines: Fully self-referential optimal universal self-improvers. In B. Goertzel and C. Pennachin (eds.), Artificial General Intelligence, pp.\ 199--226. Springer Verlag, 2006. Variant available as arXiv:cs.LO/0309048

work page arXiv 2006

[49] [50]

Schmidhuber

J. Schmidhuber. Ultimate cognition \` a la G\" o del . Cognitive Computation, 1 0 (2): 0 177--193, 2009

work page 2009

[50] [51]

Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-

J \"u rgen Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook. PhD thesis, 1987

work page 1987

[51] [52]

A ‘self-referential’weight matrix

J \"u rgen Schmidhuber. A ‘self-referential’weight matrix. In ICANN’93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13--16 September 1993 3, 1993 b

work page 1993

[52] [53]

On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models

J \"u rgen Schmidhuber. On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv preprint, 2015

work page 2015

[53] [54]

Reinforcement learning with self-modifying policies

J \"u rgen Schmidhuber, Jieyu Zhao, and Nicol N Schraudolph. Reinforcement learning with self-modifying policies. In Learning to learn. 1998

work page 1998

[54] [55]

Reflexion: an autonomous agent with dynamic memory and self-reflection

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint, 2023

work page 2023

[55] [56]

Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting

Marta Skreta, Naruki Yoshikawa, Sebastian Arellano-Rubach, Zhi Ji, Lasse Bj rn Kristensen, Kourosh Darvish, Al \'a n Aspuru-Guzik, Florian Shkurti, and Animesh Garg. Errors are useful prompts: Instruction guided task programming with verifier-assisted iterative prompting. arXiv preprint, 2023

work page 2023

[56] [57]

Learning to program = learning to construct mechanisms and explanations

Elliot Soloway. Learning to program = learning to construct mechanisms and explanations. Communications of the ACM, 1986

work page 1986

[57] [58]

Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

work page 2023

[58] [59]

Biocoder: A benchmark for bioinformatics code generation with contextual pragmatic knowledge

Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, and Mark Gerstein. Biocoder: A benchmark for bioinformatics code generation with contextual pragmatic knowledge. arXiv preprint arXiv:2308.16458, 2023 a

work page arXiv 2023

[59] [60]

arXiv preprint arXiv:2311.10537 , year=

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. Medagents: Large language models as collaborators for zero-shot medical reasoning. arXiv preprint arXiv:2311.10537, 2023 b

work page arXiv 2023

[60] [61]

Auto-gpt

Torantulino et al. Auto-gpt. https://github.com/Significant-Gravitas/Auto-GPT, 2023

work page 2023

[61] [62]

R. J. Waldinger and R. C. T. Lee. PROW: a step toward automatic program writing. In D. E. Walker and L. M. Norton (eds.), Proceedings of the 1st International Joint Conference on Artificial Intelligence (IJCAI), 1969

work page 1969

[62] [63]

Voyager: An open-ended embodied agent with large language models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint, 2023 a

work page 2023

[63] [64]

A survey on large language model based autonomous agents

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents. arXiv preprint, 2023 b

work page 2023

[64] [65]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint, 2022

work page 2022

[65] [66]

Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. arXiv preprint, 2023 c

work page 2023

[66] [67]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022

work page 2022

[67] [68]

Jennings

Michael Wooldridge and Nicholas R. Jennings. Pitfalls of agent-oriented development. In Proceedings of the Second International Conference on Autonomous Agents, 1998. URL https://doi.org/10.1145/280765.280867

work page doi:10.1145/280765.280867 1998

[68] [69]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint, 2022

work page 2022

[69] [70]

Self-taught optimizer (stop): Recursively self-improving code generation

Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Tauman Kalai. Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint, 2023

work page 2023

[70] [71]

Building cooperative embodied agents modularly with large language models

Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models. arXiv preprint, 2023 a

work page 2023

[71] [72]

arXiv preprint arXiv:2311.11797 , year=

Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, et al. Igniting language intelligence: The hitchhiker's guide from chain-of-thought reasoning to language agents. arXiv preprint arXiv:2311.11797, 2023 b

work page arXiv 2023

[72] [73]

Chat with the environment: Interactive multimodal perception using large language models

Xufeng Zhao, Mengdi Li, Cornelius Weber, Muhammad Burhan Hafez, and Stefan Wermter. Chat with the environment: Interactive multimodal perception using large language models. arXiv preprint, 2023

work page 2023

[73] [74]

Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023

Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023

work page 2023

[74] [75]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint, 2023 a

work page 2023

[75] [76]

Agents: An Open-source Framework for Autonomous Language Agents, 2023

Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. Agents: An open-source framework for autonomous language agents. arXiv preprint arXiv:2309.07870, 2023 b

work page arXiv 2023

[76] [77]

Mindstorms in natural language-based societies of mind

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, R \'o bert Csord \'a s, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint, 2023

work page 2023

[77] [78]

C. F. Gauss , title =

work page

[78] [79]

Theoria motus corporum coelestium in sectionibus conicis solem ambientium , author=

work page

[79] [80]

Nouvelles m

Legendre, Adrien Marie , year=. Nouvelles m

work page

[80] [81]

Philosophical Transactions of the Royal Society of London , volume=

An essay toward solving a problem in the doctrine of chances , author=. Philosophical Transactions of the Royal Society of London , volume=

work page