Mixed citations

Title resolution pending

Ronald J. Williams · 1992 · Machine Learning · DOI 10.1007/bf00992696

Mixed citation behavior. Most common role is background (50%).

56 Pith papers citing it

3,255 external citations · Crossref

Background 50% of classified citations

open at publisher browse 56 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 4 method 3 other 1

citation-polarity summary

background 4 use method 3 unclear 1

representative citing papers

DecompRL: Solving Harder Problems by Learning Modular Code Generation

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.

ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit

cs.MA · 2026-06-29 · unverdicted · novelty 7.0

ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.

World Model Self-Distillation: Training World Models to Solve General Tasks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.

Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster

stat.ML · 2026-05-18 · unverdicted · novelty 7.0

FLDD learns non-Markovian marginal and posterior distributions for the forward process so a factorized reverse process can match the target better and produce higher-quality samples in fewer steps.

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

cs.LG · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.

Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models

eess.AS · 2026-05-05 · unverdicted · novelty 7.0

Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.

Deep Variational Inference Symbolic Regression

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

DVISR performs variational inference over symbolic expression trees and constants by training a neural network with the ELBO as reward, recovering true posteriors in simple test cases.

When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models

cs.IR · 2026-04-29 · unverdicted · novelty 7.0

ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

cs.LG · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.

Concave Statistical Utility Maximization Bandits via Influence-Function Gradients

stat.ML · 2026-04-24 · unverdicted · novelty 7.0

A framework for concave distributional utility maximization in stochastic bandits via influence-function stochastic gradients and entropic mirror ascent on the simplex, with regret bounds.

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.

RLGT: A reinforcement learning framework for extremal graph theory

cs.LG · 2026-02-19 · unverdicted · novelty 7.0

RLGT is a modular reinforcement learning framework for extremal graph theory that handles undirected, directed, looped, and multi-colored graphs to facilitate future research.

Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

cs.AI · 2026-01-13 · unverdicted · novelty 7.0

OSPO redistributes sequence-level advantages in LLM RL training via Shapley-Owen values on semantic coalitions to improve token-level credit assignment without parametric value models.

Causal Process Models: Reframing Dynamic Causal Graph Discovery as a Reinforcement Learning Problem

cs.LG · 2025-07-18 · unverdicted · novelty 7.0

Causal Process Models reframe dynamic causal graph discovery as multi-agent reinforcement learning to build sparse time-varying graphs only at active interactions, outperforming dense baselines on physical prediction.

Variational Sequential Optimal Experimental Design using Reinforcement Learning

stat.ML · 2023-06-17 · unverdicted · novelty 7.0

vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

cs.LG · 2023-05-29 · accept · novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

cs.LG · 2026-07-02 · conditional · novelty 6.0

A 4B compiler model generates LoRA adapters from natural-language specs, enabling a frozen 0.6B interpreter to match Qwen3-32B performance on fuzzy text tasks at 50× less memory.

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.

Scoring Is Not Enough: Addressing Gaps in Utility-fairness Trade-offs for Ranking

cs.IR · 2026-06-24 · unverdicted · novelty 6.0

Scoring functions are sub-optimal for all utility-fairness trade-offs in ranking under a generic fairness formulation, but semi-greedy post-processing can approach the performance of exhaustive post-processing.

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.

Sampling Triangulations and Calabi-Yau Threefolds with Autoregressive GNNs

hep-th · 2026-05-26 · unverdicted · novelty 6.0

Introduces dualGNN, an autoregressive message-passing GNN using signed circuits to sample uniform fine regular triangulations of lattice polytopes, applied to Calabi-Yau threefolds at h^{1,1}=86 and 128.

Predictive Prefetching for Retrieval-Augmented Generation

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.

Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training

cs.LG · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.

citing papers explorer

Showing 50 of 56 citing papers.

DecompRL: Solving Harder Problems by Learning Modular Code Generation cs.LG · 2026-07-02 · unverdicted · none · ref 59
DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.
ECHO: Learning Epistemically Adaptive Language Agents with Turn-Level Credit cs.MA · 2026-06-29 · unverdicted · none · ref 70
ECHO is a clipped policy-gradient method that uses posterior-sensitive rewards to give turn-level epistemic credit in multi-turn information-seeking tasks, outperforming trajectory-level GRPO on a new Clue Selector Game benchmark.
World Model Self-Distillation: Training World Models to Solve General Tasks cs.CV · 2026-06-10 · unverdicted · none · ref 68
Self-distillation from a caption-conditioned video diffusion model to an image-and-prompt-conditioned executor, enhanced by RL from VLM feedback, enables task solving in world models.
Forward-Learned Discrete Diffusion: Learning how to noise to denoise faster stat.ML · 2026-05-18 · unverdicted · none · ref 15
FLDD learns non-Markovian marginal and posterior distributions for the forward process so a factorized reverse process can match the target better and produce higher-quality samples in fewer steps.
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States cs.LG · 2026-05-08 · unverdicted · none · ref 35 · 2 links
POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
Rethinking Entropy Minimization in Test-Time Adaptation for Autoregressive Models eess.AS · 2026-05-05 · unverdicted · none · ref 31
Derives a rigorous entropy minimization formulation for autoregressive test-time adaptation that decomposes into policy gradient and entropy terms, reinterpreting prior methods and improving Whisper ASR across 20+ domains.
Deep Variational Inference Symbolic Regression cs.LG · 2026-05-01 · unverdicted · none · ref 4
DVISR performs variational inference over symbolic expression trees and constants by training a neural network with the ELBO as reward, recovering true posteriors in simple test cases.
When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models cs.IR · 2026-04-29 · unverdicted · none · ref 38
ReaLM-Retrieve uses step-level uncertainty to trigger retrievals during reasoning, achieving 10.1% better F1 scores and 47% fewer calls on multi-hop QA benchmarks.
How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum cs.LG · 2026-04-28 · unverdicted · none · ref 3 · 2 links
A single-parameter Tsallis loss continuum unifies SFT and RLVR, derives time-to-escape bounds for cold start, and yields GARL and PAFT estimators that improve performance on QA reasoning tasks.
Concave Statistical Utility Maximization Bandits via Influence-Function Gradients stat.ML · 2026-04-24 · unverdicted · none · ref 14
A framework for concave distributional utility maximization in stochastic bandits via influence-function stochastic gradients and entropic mirror ascent on the simplex, with regret bounds.
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training cs.LG · 2026-04-21 · unverdicted · none · ref 36
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
RLGT: A reinforcement learning framework for extremal graph theory cs.LG · 2026-02-19 · unverdicted · none · ref 55
RLGT is a modular reinforcement learning framework for extremal graph theory that handles undirected, directed, looped, and multi-colored graphs to facilitate future research.
Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs cs.AI · 2026-01-13 · unverdicted · none · ref 3
OSPO redistributes sequence-level advantages in LLM RL training via Shapley-Owen values on semantic coalitions to improve token-level credit assignment without parametric value models.
Causal Process Models: Reframing Dynamic Causal Graph Discovery as a Reinforcement Learning Problem cs.LG · 2025-07-18 · unverdicted · none · ref 26
Causal Process Models reframe dynamic causal graph discovery as multi-agent reinforcement learning to build sparse time-varying graphs only at active interactions, outperforming dense baselines on physical prediction.
Variational Sequential Optimal Experimental Design using Reinforcement Learning stat.ML · 2023-06-17 · unverdicted · none · ref 57
vsOED uses a variational one-point reward and RL policy optimization to provide a lower bound on expected information gain for sequential experimental design, supporting nuisance parameters, implicit likelihoods, and multiple design goals.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model cs.LG · 2023-05-29 · accept · none · ref 49
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Program-as-Weights: A Programming Paradigm for Fuzzy Functions cs.LG · 2026-07-02 · conditional · none · ref 38
A 4B compiler model generates LoRA adapters from natural-language specs, enabling a frozen 0.6B interpreter to match Qwen3-32B performance on fuzzy text tasks at 50× less memory.
Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL cs.LG · 2026-07-01 · unverdicted · none · ref 70
FADE is a self-adapting advantage for policy-gradient RL that reads training dynamics to balance positive/negative gradient mass and difficulty focus, yielding faster peak performance and better accuracy-diversity trade-offs than static baselines on LLM reasoning benchmarks.
Scoring Is Not Enough: Addressing Gaps in Utility-fairness Trade-offs for Ranking cs.IR · 2026-06-24 · unverdicted · none · ref 52
Scoring functions are sub-optimal for all utility-fairness trade-offs in ranking under a generic fairness formulation, but semi-greedy post-processing can approach the performance of exhaustive post-processing.
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models cs.CL · 2026-06-09 · unverdicted · none · ref 44
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning cs.LG · 2026-06-08 · unverdicted · none · ref 33
Dropout-GRPO uses structured dropout to generate trajectory variance for GRPO in latent-reasoning models like Coconut, raising GSM8K pass@1 from 27.29% to 29.01%.
Sampling Triangulations and Calabi-Yau Threefolds with Autoregressive GNNs hep-th · 2026-05-26 · unverdicted · none · ref 17
Introduces dualGNN, an autoregressive message-passing GNN using signed circuits to sample uniform fine regular triangulations of lattice polytopes, applied to Calabi-Yau threefolds at h^{1,1}=86 and 128.
Predictive Prefetching for Retrieval-Augmented Generation cs.CL · 2026-05-18 · unverdicted · none · ref 39
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
Learning-Zone Energy: Online Data Selection for Efficient RL Post-Training cs.LG · 2026-05-16 · unverdicted · none · ref 37 · 2 links
Learning-Zone Energy is a new online data selection framework for RL post-training that retains 40% of data per step yet matches or exceeds full-data baselines on math tasks with 36% lower FLOPs.
Physics Guided Generative Optimization for Trotter Suzuki Decomposition quant-ph · 2026-05-13 · unverdicted · none · ref 34 · 2 links
P-GONE applies generative ML to optimize Trotter-Suzuki decompositions, reporting up to 19.4x circuit depth reduction at F >= 0.95 versus Qiskit baselines on structured Hamiltonians.
Plan Before You Trade: Inference-Time Optimization for RL Trading Agents cs.LG · 2026-05-12 · unverdicted · none · ref 23
FPILOT optimizes pre-trained RL trading policies at inference time using forecasted price trajectories to improve portfolio allocations and risk-adjusted returns on the DJ30 benchmark.
Sensitivity Analysis in the Face of Rare Events cond-mat.stat-mech · 2026-05-09 · unverdicted · none · ref 25
A pipeline combining importance sampling with Markov state models, chain-rule sensitivities, and RiteWeight reweighting enables efficient parameter optimization for rare-event dynamics in nonequilibrium systems.
Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime cs.LG · 2026-05-06 · unverdicted · none · ref 21 · 2 links
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates near 50% in binary-reward RL, delivering wall-clock speedups and modest performance gains on SWE-bench Verified and AIME tasks.
Quantum Hierarchical Reinforcement Learning via Variational Quantum Circuits cs.LG · 2026-05-05 · unverdicted · none · ref 23
Hybrid agent with variational quantum circuits for feature extraction in hierarchical RL outperforms classical baselines with 66% parameter savings, but quantum value estimation degrades results.
Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations cs.AI · 2026-05-04 · unverdicted · none · ref 20
Spectral partitioning on pairwise mutual-information graphs from agent hidden states detects representational coalitions that behavioral measures miss in multi-agent AI.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives cs.CL · 2026-04-22 · unverdicted · none · ref 196
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
MICA: Multi-granularity Intertemporal Credit Assignment for Long-Horizon Emotional Support Dialogue cs.CL · 2026-03-06 · unverdicted · none · ref 46
MICA combines incremental per-turn distance rewards and Monte Carlo returns from a shared potential function over user support states to create a mixed advantage signal that enables stable multi-turn RL optimization for emotional support dialogues.
DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions cs.LG · 2025-09-23 · unverdicted · none · ref 30
DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.
Scalable Option Learning in High-Throughput Environments cs.LG · 2025-08-30 · unverdicted · none · ref 73
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 152
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents cs.AI · 2024-08-13 · unverdicted · none · ref 144
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive cs.CL · 2024-02-20 · conditional · none · ref 293
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning cs.LG · 2019-10-01 · conditional · none · ref 20
AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.
Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives cs.LG · 2019-06-25 · unverdicted · none · ref 42
RL policies decompose into information-regularized primitives that compete by requesting state information amounts, with the greediest one acting, yielding better generalization than flat or hierarchical baselines.
Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents cs.LG · 2026-06-10 · unverdicted · none · ref 49
SGCD improves held-out scores on AppWorld and tau^3-airline by using LLM-summarized sibling contrasts to reshape GRPO advantages while keeping policy gradient in charge of the actor update.
SecRL-Prune: Structured Reinforcement Learning-Based Pruning of CodeLLMs for Preserving Adversarial Code Mutation cs.CR · 2026-06-04 · unverdicted · none · ref 28
SecRL-Prune learns layer-wise pruning policies via RL on CodeLLMs, preserving higher pass@k and var@k than baselines at 10-30% compression on HumanEval and enabling semantics-preserving mutations that reduce malware detections in a case study.
On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters cs.LG · 2026-06-01 · unverdicted · none · ref 33
PEFT adapters are positioned as persistent personal state on foundation models, organized via Scale Up, Scale Down, and Scale Out axes, with MinT as an infrastructure example for managing them.
$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control cs.LG · 2026-05-18 · unverdicted · none · ref 34
f-OPD decomposes on-policy distillation drift into rollout and supervision components, then applies a sample-level freshness score to adaptively limit stale data influence and stabilize long-horizon agent training.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 55
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
Compute Aligned Training: Optimizing for Test Time Inference cs.LG · 2026-04-27 · unverdicted · none · ref 27 · 2 links
Derives new loss functions for SFT and RL that optimize directly for test-time inference operators like aggregation or filtering, with empirical gains in scaling.
Polychromic Objectives for Reinforcement Learning cs.LG · 2025-09-29 · unverdicted · none · ref 42
Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 125
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning cs.LG · 2024-02-18 · unverdicted · none · ref 61
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
Learning Adversarial Augmentation Policies for Robust Garlic Seedling Detection cs.CV · 2026-06-25 · unverdicted · none · ref 51
A new outdoor garlic seedling dataset and adversarial augmentation policy learning improve detection AP50 to 91.6% and missing-seedling F1 to 67% under variable illumination.
Optimal sequential decision-making for error propagation mitigation in digital twins cs.LG · 2026-04-24 · unverdicted · none · ref 24
Error propagation mitigation in digital twins is cast as an MDP/POMDP with HMM-derived regimes as states, where the MDP policy maximizes reward and the POMDP recovers 95% of that performance.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer