pith. sign in

arxiv: 2402.03300 · v3 · submitted 2024-02-05 · 💻 cs.CL · cs.AI· cs.LG

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Pith reviewed 2026-05-24 03:20 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords mathematical reasoninglanguage modelscontinued pre-trainingreinforcement learningMATH benchmarkpolicy optimizationopen source modelsweb data curation
0
0 comments X

The pith

DeepSeekMath 7B reaches 51.7% on the MATH benchmark by continuing pre-training on 120B curated web math tokens and applying Group Relative Policy Optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DeepSeekMath 7B, created by taking DeepSeek-Coder-Base-v1.5 7B and continuing its pre-training on 120 billion math-related tokens extracted from Common Crawl along with natural language and code data. It reports 51.7 percent accuracy on the competition-level MATH benchmark without external toolkits or voting, and 60.9 percent when applying self-consistency over 64 samples. The authors present this result as evidence that the combination of a careful web-data selection pipeline and the new Group Relative Policy Optimization method can produce strong mathematical reasoning in an open 7B model.

Core claim

DeepSeekMath 7B shows that continued pre-training on a large volume of curated math tokens from public web data, followed by reinforcement learning with Group Relative Policy Optimization, enables a 7B open model to reach 51.7 percent on the MATH benchmark and approach the level of closed frontier systems without relying on external tools or ensembles.

What carries the argument

Group Relative Policy Optimization (GRPO), a memory-efficient variant of Proximal Policy Optimization that scores groups of responses relative to one another, combined with a data selection pipeline that extracts and filters 120B math-related tokens from Common Crawl.

If this is right

  • Self-consistency sampling over 64 responses raises MATH accuracy from 51.7 percent to 60.9 percent.
  • Open 7B models can reach performance close to closed models on competition math without tool use or voting.
  • GRPO reduces the memory footprint of PPO while still improving reasoning performance.
  • Public web data contains enough high-quality math content to support large-scale continued pre-training when filtered carefully.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same data-curation approach could be tested on other structured reasoning domains such as code or physics problem solving.
  • GRPO might transfer to reinforcement learning settings outside mathematics where relative scoring within batches is feasible.
  • Further increases in the volume of filtered math tokens or model size could narrow the remaining gap to closed frontier systems.

Load-bearing premise

The performance on MATH is driven primarily by the data selection pipeline and the GRPO algorithm rather than by other details of the base model or training setup.

What would settle it

Train an otherwise identical 7B model on the same base checkpoint but without the math-data selection step or without GRPO and measure whether accuracy on MATH stays well below 51.7 percent.

read the original abstract

Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DeepSeekMath 7B, obtained by continued pre-training of DeepSeek-Coder-Base-v1.5 7B on 120B math-related tokens from Common Crawl plus natural language and code data. It reports 51.7% accuracy on the MATH benchmark (60.9% with 64-sample self-consistency) without external toolkits or voting, approaching Gemini-Ultra and GPT-4. The authors attribute the gains to a meticulously engineered data selection pipeline from web data and the introduction of Group Relative Policy Optimization (GRPO), a PPO variant that improves mathematical reasoning while reducing memory usage.

Significance. If the attribution to the data pipeline and GRPO is substantiated by controls, the result would demonstrate that open 7B models can reach near-closed-model performance on competition math through public-data curation and a memory-efficient RL variant, offering a reproducible route for advancing reasoning capabilities.

major comments (2)
  1. [Experiments section (results and attribution paragraphs)] The central claim attributes the jump to 51.7% MATH primarily to the data selection pipeline and GRPO, yet no ablation results are supplied for (a) the base DeepSeek-Coder-Base-v1.5 7B on MATH, (b) the same 120B tokens with standard SFT or PPO instead of GRPO, or (c) the identical pipeline without the “meticulous” filtering step. This absence leaves the causal contribution of the two listed factors unsecured.
  2. [Results tables] Table reporting MATH scores (and any comparison tables) does not include the base model score or the continued-pretraining-only condition, making it impossible to quantify how much of the reported gain is due to the claimed factors versus scale of math tokens or the code-strong base model.
minor comments (2)
  1. [Abstract] The abstract states “120B math-related tokens” but does not clarify the total token count, the exact mix of math/NL/code, or the filtering criteria used in the pipeline.
  2. [Method section on GRPO] Notation for GRPO (reward formulation, group size, KL coefficient) should be defined with explicit equations in the method section to allow direct comparison with standard PPO.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address the concerns about missing ablations and table information below, and will make appropriate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments section (results and attribution paragraphs)] The central claim attributes the jump to 51.7% MATH primarily to the data selection pipeline and GRPO, yet no ablation results are supplied for (a) the base DeepSeek-Coder-Base-v1.5 7B on MATH, (b) the same 120B tokens with standard SFT or PPO instead of GRPO, or (c) the identical pipeline without the “meticulous” filtering step. This absence leaves the causal contribution of the two listed factors unsecured.

    Authors: We acknowledge the importance of ablations to substantiate the claims. In the revised manuscript, we will add the performance of the base model DeepSeek-Coder-Base-v1.5 7B on the MATH benchmark. For comparisons involving standard SFT or PPO, and the pipeline without filtering, these experiments were not conducted due to computational constraints. We will provide additional discussion on the rationale behind GRPO and the data curation process to better support the attribution. revision: partial

  2. Referee: [Results tables] Table reporting MATH scores (and any comparison tables) does not include the base model score or the continued-pretraining-only condition, making it impossible to quantify how much of the reported gain is due to the claimed factors versus scale of math tokens or the code-strong base model.

    Authors: We agree that including these baselines will improve clarity. We will update the tables in the results section to include the base model score and clarify the contributions from continued pre-training. revision: yes

standing simulated objections not resolved
  • Full ablation studies on the effects of the data filtering pipeline and direct comparisons between GRPO and standard PPO, as these require new experiments not present in the original work.

Circularity Check

0 steps flagged

No circularity; empirical training results with no self-referential derivation.

full rationale

The paper reports benchmark scores from continued pretraining of an existing base model (DeepSeek-Coder-Base-v1.5 7B) on 120B tokens followed by GRPO fine-tuning. The central claims are measured outcomes (51.7% MATH) and an attribution to two engineering choices (data pipeline + GRPO). No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citations that close a logical loop appear in the abstract or stated claims. Attribution without ablations is a weakness of evidence, not circularity by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities beyond the high-level description of GRPO as a variant of PPO.

pith-pipeline@v0.9.0 · 5753 in / 1176 out tokens · 60175 ms · 2026-05-24T03:20:21.561498+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

    cs.AI 2026-04 conditional novelty 9.0

    AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

  2. Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback

    cs.LG 2026-06 unverdicted novelty 8.0

    Noisy expert imitation learning requires exponential samples for offline methods but polynomial for a variant of on-policy distillation under a noise condition.

  3. Textual Belief States for World Models: Identifiable Representation Learning Under Strict Mediation

    cs.LG 2026-06 unverdicted novelty 8.0

    Introduces textual belief states and factorized GRPO to enforce strict latent state mediation in text-based world models, yielding preserved prediction accuracy with large gains in representation quality and rollout p...

  4. Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

    cs.LG 2026-06 conditional novelty 8.0

    RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

  5. DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

    cs.AI 2026-06 conditional novelty 8.0

    DeFAb is a large-scale, formally verifiable benchmark for defeasible abduction derived from 18 knowledge bases, demonstrating that frontier LLMs achieve 7.8-65% accuracy versus 100% for a rule-based solver with polyno...

  6. Entropy-Gated Latent Recursion

    cs.LG 2026-06 unverdicted novelty 8.0

    EGLR adds a deterministic layer-recursion axis gated by entropy that is complementary to temperature sampling, raising joint oracle accuracy on MATH-500 from 83.4% to 91.6% for a 3B model.

  7. A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

    cs.AI 2026-06 conditional novelty 8.0

    Derives an exact telescoping decomposition of the naive RLVR reward-design estimator into null, elicitation, and reward-design terms on a tabular-GRPO simulator, measures the components across prior strengths, and val...

  8. Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?

    cs.CV 2026-05 accept novelty 8.0

    Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO p...

  9. Transformers Provably Learn to Internalize Chain-of-Thought

    cs.LG 2026-05 unverdicted novelty 8.0

    L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.

  10. SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

    cs.AI 2026-05 accept novelty 8.0

    SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.

  11. DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

    cs.LG 2026-05 conditional novelty 8.0

    DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.

  12. Continual Harness: Online Adaptation for Self-Improving Foundation Agents

    cs.LG 2026-05 conditional novelty 8.0

    Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and cl...

  13. ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

    cs.LG 2026-05 conditional novelty 8.0

    ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equip...

  14. STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

    cs.CR 2026-05 unverdicted novelty 8.0

    STARE uses step-wise RL to attack multimodal models, achieving 68% higher attack success rate while revealing that adversarial optimization concentrates conceptual toxicity early and detail toxicity late in the genera...

  15. From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

    cs.SE 2026-04 unverdicted novelty 8.0

    MLLMs exhibit a Mirage effect by bypassing circuit diagrams in favor of header semantics for Verilog generation; VeriGround with identifier anonymization and D-ORPO training reaches 46% Functional Pass@1 while refusin...

  16. From Context to Skills: Can Language Models Learn from Context Skillfully?

    cs.AI 2026-04 unverdicted novelty 8.0

    Ctx2Skill lets language models autonomously evolve context-specific skills via multi-agent self-play, improving performance on context learning tasks without human supervision.

  17. S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

    cs.CV 2026-04 unverdicted novelty 8.0

    S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

  18. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  19. MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

    cs.CL 2026-04 unverdicted novelty 8.0

    MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....

  20. RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

    cs.CV 2026-04 unverdicted novelty 8.0

    RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

  21. Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

    cs.LG 2026-04 unverdicted novelty 8.0

    Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.

  22. GIANTS: Generative Insight Anticipation from Scientific Literature

    cs.CL 2026-04 unverdicted novelty 8.0

    GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

  23. SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

    cs.AI 2026-03 conditional novelty 8.0

    SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

  24. SEVerA: Verified Synthesis of Self-Evolving Agents

    cs.LG 2026-03 unverdicted novelty 8.0

    SEVerA uses Formally Guarded Generative Models and a three-stage Search-Verification-Learning process to synthesize self-evolving agents that satisfy hard formal constraints while improving task performance.

  25. Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

    cs.LG 2026-03 unverdicted novelty 8.0

    Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.

  26. RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

    cs.CR 2025-09 conditional novelty 8.0

    RLCracker is a reinforcement learning attack that erases LLM watermarks at 98.5% success rate with minimal data and generalizes across ten schemes and multiple model sizes.

  27. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  28. DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

    cs.CL 2025-04 conditional novelty 8.0

    DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

  29. Seek to Segment: Active Perception for Panoramic Referring Segmentation

    cs.CV 2026-07 unverdicted novelty 7.0

    Introduces APRS task and PanoSeeker agent using VLM plus EgoSphere memory for active 360° search and segmentation, outperforming baselines on a new benchmark.

  30. DecompRL: Solving Harder Problems by Learning Modular Code Generation

    cs.LG 2026-07 unverdicted novelty 7.0

    DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.

  31. Evidence-State Rewards for Long-Context Reasoning

    cs.AI 2026-07 unverdicted novelty 7.0

    Maven is an RL method using answer-conditioned evidence-state values to assign rewards to add, link, and drop actions on evidence memory, outperforming outcome-only baselines on LongBench v2, LongReason, and RULER.

  32. Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

    cs.CV 2026-07 unverdicted novelty 7.0

    P2R decouples perception from reasoning in VLMs via a two-stage process and PRA-GRPO alternating RL training, reporting gains such as 93.2% on V-Star for the 4B model over its Qwen3-VL backbone.

  33. Flow-Map GRPO: Reinforcement Learning for Few-Step Flow-Map Generators via Anchored Stochastic Composition

    cs.LG 2026-07 unverdicted novelty 7.0

    Flow-Map GRPO uses anchored stochastic flow map composition to enable GRPO-based RL alignment of deterministic few-step flow-map generators while preserving their marginal paths.

  34. Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

    cs.CL 2026-07 unverdicted novelty 7.0

    DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.

  35. Verifiable Rewards for Calibrated Probabilistic Forecasting

    cs.LG 2026-06 unverdicted novelty 7.0

    A verifiable empirical win rate reward combined with gradient masking enables RL training of a 7B model to reach betting-market calibration on NFL win probabilities using only outcome data.

  36. GRPO, Dr. GRPO, and DAPO Are Three Operations on One Number: The Group-Standard-Deviation Identity

    cs.LG 2026-06 unverdicted novelty 7.0

    GRPO, Dr. GRPO, and DAPO are three settings of one dial on the group standard deviation of binary rewards, unified by the group-standard-deviation identity where disagreement equals update magnitude.

  37. GEAR: Guided End-to-End AutoRegression for Image Synthesis

    cs.CV 2026-06 unverdicted novelty 7.0

    GEAR jointly trains VQ tokenizer and AR generator end-to-end via dual hard/soft read-out and representation alignment, achieving up to 10x faster ImageNet gFID convergence than LlamaGen-REPA while generalizing across ...

  38. QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents

    cs.LG 2026-06 unverdicted novelty 7.0

    QVal is a new evaluation framework that directly measures dense supervision quality via Q-alignment to a reference policy, showing simple prompting baselines outperform 21 other methods across environments and models.

  39. TRIAGE: Role-Typed Credit Assignment for Agentic Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 7.0

    TRIAGE augments GRPO with role-typed segment rewards derived from a judge that detects regression and exploration, yielding higher success rates and fewer turns on ALFWorld, Search-QA, and WebShop.

  40. Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist

    cs.AI 2026-06 unverdicted novelty 7.0

    Arena-T2I Hard benchmark with ~30 decomposed constraints per prompt and a dependency-aware checklist reward yields better faithfulness-aesthetics trade-off than single-reward or weighted-sum baselines on SD3.5-Medium ...

  41. Learning Where to Look: A Reinforcement Learning Framework for Robust Micro-Ultrasound Prostate Cancer Detection

    cs.CV 2026-06 conditional novelty 7.0

    Prost-RL integrates an RL policy into a foundation-model encoder-decoder to generate interpretable spatial attention maps that improve core-level prostate cancer detection in micro-ultrasound, achieving 79.0 AUROC on ...

  42. Predictable GRPO: A Closed-Form Model of Training Dynamics

    cs.LG 2026-06 unverdicted novelty 7.0

    GRPO updates reduce to a damped oscillator whose mass, damping, and stiffness are fixed by optimizer hyperparameters plus one measured curvature scale, subsuming single-exponential saturation while adding inertial slo...

  43. Predictable GRPO: A Closed-Form Model of Training Dynamics

    cs.LG 2026-06 unverdicted novelty 7.0

    A closed-form inertial model of GRPO dynamics that subsumes single-exponential saturation as its overdamped limit and predicts group-size invariance, stability thresholds, and overdamped-to-oscillatory transitions.

  44. What Drives Interactive Improvement from Feedback?

    cs.AI 2026-06 unverdicted novelty 7.0

    Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

  45. When Does Online Imitation Learning Help in LLM Post-Training? The Role of (Non-)Realizability Beyond Horizon

    cs.LG 2026-06 unverdicted novelty 7.0

    Online IL overcomes an information-theoretic bottleneck that offline IL faces in non-realizable settings even at horizon 1, under a new structural characterization of reward-relative misspecification.

  46. OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

    cs.CV 2026-06 unverdicted novelty 7.0

    OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

  47. ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit

    cs.MA 2026-06 unverdicted novelty 7.0

    ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Ga...

  48. The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 7.0

    Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

  49. CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

    cs.LG 2026-06 unverdicted novelty 7.0

    CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.

  50. Masked Diffusion Decoding as $x$-Prediction Flow

    cs.CL 2026-06 unverdicted novelty 7.0

    Masked diffusion LMs can use continuous x-prediction flow with token-wise asynchronous updates and an RL policy network to reach 97% performance on HumanEval using only 25% of the usual decoding budget.

  51. Personalizing MLLMs via Reinforced Multimodal Reference Game

    cs.CV 2026-06 unverdicted novelty 7.0

    RRG trains MLLMs via a reinforced multimodal reference game with contrastive rewards on hard positives and negatives to produce accurate, discriminative concept descriptions, achieving SOTA on personalization benchmarks.

  52. An AI agent for treatment reasoning over a biomedical tool universe

    cs.AI 2026-06 unverdicted novelty 7.0

    ATHENA-R1 is an RL-trained agent using 212 biomedical tools that achieves 94.7% accuracy on drug reasoning and 82.9% on treatment reasoning tasks, outperforming GPT-5 by 17.8 and 10.7 points respectively.

  53. Search for Truth from Reasoning: A Dynamic Representation Editing Framework for Steering LLM Trajectories

    cs.AI 2026-06 unverdicted novelty 7.0

    DynaSteer dynamically steers LLM reasoning trajectories toward truth via pattern clustering, Fisher-LDA projection, and entropy-triggered representation edits, improving performance on MATH and generalizing to coding.

  54. Tandem Reinforcement Learning with Verifiable Rewards

    cs.AI 2026-06 unverdicted novelty 7.0

    TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

  55. TempAct: Advancing Temporal Plausibility in Autoregressive Video Generation via Planner-Executor RL

    cs.CV 2026-06 unverdicted novelty 7.0

    TempAct introduces a planner-executor RL framework with hierarchical group exploration and rewards to improve temporal consistency in autoregressive video diffusion models.

  56. Reflect-R1: Evidence-Driven Reflection for Self-Correction in Long Video Understanding

    cs.CV 2026-06 unverdicted novelty 7.0

    Reflect-R1 introduces the first evidence-driven self-correction framework for long video understanding using a three-stage pipeline, stage-decoupled RL via SD-GRPO, and a 120K dataset to achieve SOTA on VideoMME and L...

  57. Dockerless: Environment-Free Program Verifier for Coding Agents

    cs.SE 2026-06 unverdicted novelty 7.0

    Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while m...

  58. Supersede: Diagnosing and Training the Memory-Update Gap in LLM Agents

    cs.CL 2026-06 unverdicted novelty 7.0

    The supersession gap in LLM agents—failing to use current facts and discard superseded ones—is a distinct failure not fixed by scale or memory size, but improvable via RL training on a new environment.

  59. Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch

    cs.DC 2026-06 unverdicted novelty 7.0

    Moebius enables runtime switching between EP and TP for MoE LLMs by resharing weights and KV cache, matching the best static choice and improving RL rollouts by 1.16-1.25x with 215-434 ms switches and 2.4% memory overhead.

  60. OpenFinGym: A Verifiable Multi-Task Gym Environment for Evaluating Quant Agents

    cs.AI 2026-06 unverdicted novelty 7.0

    OpenFinGym is a multi-task verifiable gym environment for quant-finance agents with automated task construction from publications, containerised runtime, paper trading engine, and support for SFT/RL training.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 1924 Pith papers · 29 internal anchors

  1. [1]

    R. Anil, S. Borgeaud, Y. Wu, J. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, D. Silver, S. Petrov, M. Johnson, I. Antonoglou, J. Schrittwieser, A. Glaese, J. Chen, E. Pitler, T. P. Lillicrap, A. Lazaridou, O. Firat, J. Molloy, M. Isard, P. R. Barham, T. Hennigan, B. Lee, F. Viola, M. Reynolds, Y. Xu, R. Doherty, E. Collins, ...

  2. [2]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    Llemma: An Open Language Model For Mathematics

    Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023

  4. [4]

    J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    arXiv preprint arXiv:2312.09390 , year=

    C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, et al. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023

  6. [6]

    Chatglm3 series: Open bilingual chat llms, 2023

    ChatGLM3 Team . Chatglm3 series: Open bilingual chat llms, 2023. URL https://github.com/THUDM/ChatGLM3

  7. [7]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herb...

  8. [8]

    W. Chen, X. Ma, X. Wang, and W. W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. CoRR, abs/2211.12588, 2022. doi:10.48550/ARXIV.2211.12588. URL https://doi.org/10.48550/arXiv.2211.12588

  9. [9]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Computer

    T. Computer. Redpajama: an open dataset for training large language models, Oct. 2023. URL https://github.com/togethercomputer/RedPajama-Data

  11. [11]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek-AI. Deepseek LLM: scaling open-source language models with longtermism. CoRR, abs/2401.02954, 2024. doi:10.48550/ARXIV.2401.02954. URL https://doi.org/10.48550/arXiv.2401.02954

  12. [12]

    Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320--335, 2022

  13. [13]

    L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig. PAL: program-aided language models. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research,...

  14. [14]

    Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, M. Huang, N. Duan, and W. Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving. CoRR, abs/2309.17452, 2023. doi:10.48550/ARXIV.2309.17452. URL https://doi.org/10.48550/arXiv.2309.17452

  15. [15]

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming -- the rise of code intelligence, 2024

  16. [16]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  17. [17]

    Measuring Mathematical Problem Solving With the MATH Dataset

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  18. [18]

    Hai-llm: 高效且轻量的大模型训练工具, 2023

    High-flyer. Hai-llm: 高效且轻量的大模型训练工具, 2023. URL https://www.high-flyer.cn/en/blog/hai-llm

  19. [19]

    Inflection-2, 2023

    Inflection AI . Inflection-2, 2023. URL https://inflection.ai/inflection-2

  20. [20]

    A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, M. Jamnik, T. Lacroix, Y. Wu, and G. Lample. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022

  21. [21]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  22. [22]

    FastText.zip: Compressing text classification models

    A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J \'e gou, and T. Mikolov. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016

  23. [23]

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  24. [24]

    Leviathan, M

    Y. Leviathan, M. Kalman, and Y. Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274--19286. PMLR, 2023

  25. [25]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35: 0 3843--3857, 2022 a

  26. [26]

    Lewkowycz, A

    A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil, I. Schlag, T. Gutman - Solo, Y. Wu, B. Neyshabur, G. Gur - Ari, and V. Misra. Solving quantitative reasoning problems with language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processi...

  27. [27]

    Let's Verify Step by Step

    H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe. Let's verify step by step. arXiv preprint arXiv:2305.20050, 2023

  28. [28]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  29. [29]

    H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023

  30. [30]

    Mishra, M

    S. Mishra, M. Finlayson, P. Lu, L. Tang, S. Welleck, C. Baral, T. Rajpurohit, O. Tafjord, A. Sabharwal, P. Clark, and A. Kalyan. LILA: A unified benchmark for mathematical reasoning. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab...

  31. [31]

    Nguyen, W

    X. Nguyen, W. Zhang, X. Li, M. M. Aljunied, Q. Tan, L. Cheng, G. Chen, Y. Deng, S. Yang, C. Liu, H. Zhang, and L. Bing. Seallms - large language models for southeast asia. CoRR, abs/2312.00738, 2023. doi:10.48550/ARXIV.2312.00738. URL https://doi.org/10.48550/arXiv.2312.00738

  32. [32]

    GPT-4 Technical Report

    OpenAI. GPT4 technical report. arXiv preprint arXiv:2303.08774, 2023

  33. [33]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 0 27730--27744, 2022

  34. [34]

    arXiv preprint arXiv:2310.06786 , year=

    K. Paster, M. D. Santos, Z. Azerbayev, and J. Ba. Openwebmath: An open dataset of high-quality mathematical web text. CoRR, abs/2310.06786, 2023. doi:10.48550/ARXIV.2310.06786. URL https://doi.org/10.48550/arXiv.2310.06786

  35. [35]

    L. C. Paulson. Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. In R. A. Schmidt, S. Schulz, and B. Konev, editors, Proceedings of the 2nd Workshop on Practical Aspects of Automated Reasoning, PAAR-2010, Edinburgh, Scotland, UK, July 14, 2010, volume 9 of EPiC Series in Computing, pages 1--10...

  36. [36]

    Generative Language Modeling for Automated Theorem Proving

    S. Polu and I. Sutskever. Generative language modeling for automated theorem proving. CoRR, abs/2009.03393, 2020. URL https://arxiv.org/abs/2009.03393

  37. [37]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. 2023

  38. [38]

    Schulman

    J. Schulman. Approximating kl divergence, 2020. URL http://joschu.net/blog/kl-approx.html

  39. [39]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015

  40. [40]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  41. [41]

    F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, D. Zhou, D. Das, and J. Wei. Language models are multilingual chain-of-thought reasoners. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023. URL https://openreview.net/pdf?id=fR...

  42. [42]

    F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang. Preference ranking optimization for human alignment. arXiv preprint arXiv:2306.17492, 2023

  43. [43]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    M. Suzgun, N. Scales, N. Sch \"a rli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

  44. [44]

    T. Tao. Embracing change and resetting expectations, 2023. URL https://unlocked.microsoft.com/ai-anthology/terence-tao/

  45. [45]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. Canton - Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Ko...

  46. [46]

    T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong. Solving olympiad geometry without human demonstrations. Nature, 625 0 (7995): 0 476--482, 2024

  47. [47]

    P. Wang, L. Li, L. Chen, F. Song, B. Lin, Y. Cao, T. Liu, and Z. Sui. Making large language models better reasoners with alignment. arXiv preprint arXiv:2309.02144, 2023 a

  48. [48]

    P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023 b

  49. [49]

    Z. Wang, R. Xia, and P. Liu. Generative AI for math: Part I - mathpile: A billion-token-scale pretraining corpus for math. CoRR, abs/2312.17120, 2023 c . doi:10.48550/ARXIV.2312.17120. URL https://doi.org/10.48550/arXiv.2312.17120

  50. [50]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper\_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

  51. [51]

    T. Wei, J. Luan, W. Liu, S. Dong, and B. Wang. Cmath: Can your language model pass chinese elementary school math test?, 2023

  52. [52]

    The isabelle framework,

    M. Wenzel, L. C. Paulson, and T. Nipkow. The isabelle framework. In O. A. Mohamed, C. A. Mu \ n oz, and S. Tahar, editors, Theorem Proving in Higher Order Logics, 21st International Conference, TPHOLs 2008, Montreal, Canada, August 18-21, 2008. Proceedings, volume 5170 of Lecture Notes in Computer Science, pages 33--38. Springer, 2008. doi:10.1007/978-3-5...

  53. [53]

    H. Xia, T. Ge, P. Wang, S.-Q. Chen, F. Wei, and Z. Sui. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909--3925, Singapore, Dec. 2023. Association for Computational Linguistics. doi:10.18...

  54. [54]

    H. Xia, Z. Yang, Q. Dong, P. Wang, Y. Li, T. Ge, T. Liu, W. Li, and Z. Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024

  55. [55]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

  56. [56]

    L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284, 2023. doi:10.48550/ARXIV.2309.12284. URL https://doi.org/10.48550/arXiv.2309.12284

  57. [57]

    Z. Yuan, H. Yuan, C. Li, G. Dong, C. Tan, and C. Zhou. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023 a

  58. [58]

    Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023 b

  59. [59]

    X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653, 2023. doi:10.48550/ARXIV.2309.05653. URL https://doi.org/10.48550/arXiv.2309.05653

  60. [60]

    MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics

    K. Zheng, J. M. Han, and S. Polu. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021

  61. [61]

    AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval : A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi:10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364