Path-Coupled Bellman Flows for Distributional Reinforcement Learning

Boyang Xu; Hao Yan; Qing Zou; Siqin Yang

arxiv: 2605.08253 · v2 · pith:YL2UTCIQnew · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

Boyang Xu , Qing Zou , Siqin Yang , Hao Yan This is my paper

Pith reviewed 2026-06-30 23:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords distributional reinforcement learningflow matchingBellman flowsreturn distributionscontrol variatespath couplingoffline RL

0 comments

The pith

Path-Coupled Bellman Flows maintain pathwise affine relations between current and successor return flows using shared base noise to learn distributions without requiring fixed points at every intermediate time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Path-Coupled Bellman Flows, a continuous-time flow-matching method for distributional reinforcement learning. It constructs paths that begin at the base prior at time zero, reach the Bellman target at time one, and keep an affine relation to the successor flow at all intermediate times while coupling the two flows through identical base noise. A lambda-parameterized control-variate target lets users recover an unbiased Bellman sample when lambda equals zero or accept controlled bias for lower variance when lambda is positive. The design avoids both boundary mismatch at the flow source and independent-noise variance that appear in prior flow-based approaches. Experiments on tractable Markov reward processes, OGBench, and D4RL tasks report higher fidelity and greater training stability together with competitive offline reinforcement learning results.

Core claim

PCBF learns return distributions with flow matching using source-consistent Bellman-coupled paths: the current path starts from the required base prior at t=0, reaches the Bellman target at t=1, and maintains a pathwise affine relation to the successor flow at intermediate times without requiring time-t marginals to satisfy a distributional Bellman fixed point for all t. PCBF couples current and successor return flows through shared base noise and uses a λ-parameterized control-variate target where λ=0 recovers an unbiased sample Bellman target while λ>0 trades controlled bias for variance reduction.

What carries the argument

source-consistent Bellman-coupled paths that enforce a pathwise affine relation between current and successor flows while sharing base noise

If this is right

Improved distributional fidelity is observed on analytically tractable Markov reward processes.
Training stability increases on OGBench and D4RL benchmarks.
The method produces competitive offline reinforcement learning performance.
Setting lambda to zero recovers an unbiased sample Bellman target while positive lambda values reduce variance at the cost of bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pathwise consistency requirement may allow stable learning in settings where enforcing marginal fixed points at every time remains computationally expensive.
Shared-noise coupling could be tested for variance reduction in other continuous-time generative models outside reinforcement learning.
If the affine relation holds reliably, longer-horizon tasks might benefit from propagating distributional information without repeated projection steps.

Load-bearing premise

A pathwise affine relation between current and successor flows can be maintained at intermediate times without requiring the time-t marginals to satisfy a distributional Bellman fixed point for all t.

What would settle it

On an analytically tractable MRP whose exact return distribution is known by dynamic programming, train PCBF and measure the Wasserstein distance between the learned distribution and the exact one; a distance comparable to or larger than that of uncoupled flow baselines would falsify the fidelity claim.

Figures

Figures reproduced from arXiv: 2605.08253 by Boyang Xu, Hao Yan, Qing Zou, Siqin Yang.

**Figure 1.** Figure 1: The Architecture of Path-Coupled Bellman Flows (PCBF). Using this control variate, we define the PCBF training target as follows: u λ t := (R+γX′−X0)+λ h vθ− (t, Zs ′ t | s ′ , a′ ) − (X′ − X0) i . (13) Setting λ = 0 recovers the baseline BCFM estimator (unbiased, high variance). Nonzero λ introduces a variancereducing correction at the cost of potential bias. Early in training, λ ≈ γ is often effective,… view at source ↗

**Figure 2.** Figure 2: Corrected Bellman residual rcorr(t, N) on Solitaire Dice. Shared-noise PCBF (blue) maintains lower residuals than independent-noise coupling (orange) across times and budgets. Toy Environments. On analytically tractable MRPs, PCBF closely matches ground-truth return laws across discrete heavy-tailed, continuous uniform, and long-horizon multimodal distributions. The strongest gains over Value Flows appear… view at source ↗

**Figure 3.** Figure 3: Learned PCBF Maps on Toy Environments. Left Top (Solitaire); Right Top (Bernoulli); and Bottom (Discrete MC). Additionally, to rigorously assess distributional fidelity, we evaluate PCBF against Value Flows with varying dcfm coefficients on the toy environments [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 3.** Figure 3: Distributional accuracy comparison on toy environments. Learned return CDFs for PCBF and Value Flows (with dcfm ∈ {0, 0.5, 1}) compared against ground-truth references. that increasing the consistency coefficient (dcfm) in Value Flows systematically degrades distributional accuracy, suggesting that strict trajectory-wide consistency conflicts with the boundary conditions required for accurate optimal tra… view at source ↗

**Figure 4.** Figure 4: Distributional accuracy comparison on toy environments. Learned return CDFs for PCBF and Value Flows (with dcfm ∈ {0, 0.5, 1}) compared against ground-truth references [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: contrasts the stability of our method against Value Flows on the Solitaire and Discrete MC tasks. Increasing the DCFM coefficient (dcfm) in Value Flows systematically degrades distributional accuracy, consistent with enforcing a full-t Bellman-shaped self-consistency term that conflicts with the Gaussian source boundary. In contrast, PCBF’s λ-target decouples variance reduction from the source/Bellman-e… view at source ↗

**Figure 5.** Figure 5: Ablation study of the λ parameter in PCBF. Red stars denote the best-performing λ on representative OGBench and D4RL tasks. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: OGBench Tasks [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 6.** Figure 6: OGBench Tasks OGBench (Park et al., 2025a). OGBench is originally designed for offline goal-conditioned reinforcement learning. Following prior work, we adopt its single-task variants (“-singletask”) to benchmark standard reward-maximizing offline RL methods. In each environment, five predefined evaluation goals are provided, yielding five corresponding single-task variants (from -singletask-task1 to -sing… view at source ↗

**Figure 7.** Figure 7: Ablation study of the λ parameter in PCBF. Red stars denote the best-performing λ on representative OGBench and D4RL tasks. E. Additional benchmark details F. Ablation Study [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 7.** Figure 7: Variance reduction via λ-parameterized control variates. Larger λ yields smoother loss trajectories (lower standard deviation), demonstrating effective variance reduction in Bellman targets. Bias–variance trade-off. While increasing λ reduces optimization variance, [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: Variance reduction via λ-parameterized control variates. Larger λ yields smoother loss trajectories (lower standard deviation), demonstrating effective variance reduction in Bellman targets. Bias–variance trade-off. While increasing λ reduces optimization variance, [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 8.** Figure 8: Hyperparameter Sensitivity Analysis (PCBF vs. Value Flows) We compare the impact of increasing the regularization coefficient on distributional accuracy (Wasserstein Distance). Orange (Dashed): Increasing the Value Flows consistency coefficient (dcfm) causes rapid performance degradation, particularly in complex environments like Discrete MC. Blue (Solid): Our PCBF Control Variate (λ) remains robust, maint… view at source ↗

**Figure 9.** Figure 9: Hyperparameter Sensitivity Analysis (PCBF vs. Value Flows) We compare the impact of increasing the regularization coefficient on distributional accuracy (Wasserstein Distance). Orange (Dashed): Increasing the Value Flows consistency coefficient (dcfm) causes rapid performance degradation, particularly in complex environments like Discrete MC. Blue (Solid): Our PCBF Control Variate (λ) remains robust, maint… view at source ↗

**Figure 11.** Figure 11: Corrected Bellman residual rcorr(t, N) on Solitaire Dice. Shared-noise PCBF (blue) maintains lower residuals than independent-noise coupling (orange) across times and budgets. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 10.** Figure 10: Full Distributional accuracy comparison. PCBF (blue) consistently tracks the ground-truth CDF (dashed black) more accurately than Value Flows (red/green), particularly in high-variance regimes. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 9.** Figure 9: Full Distributional accuracy comparison. PCBF (blue) consistently tracks the ground-truth CDF (dashed black) more accurately than Value Flows (red/green), particularly in high-variance regimes. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 11.** Figure 11: Distributional Flow Analysis on the Discrete MC Environment. We visualize the learned PCBF return distributions across states s = 1 to s = 20. The estimated probability density of the flow-transported samples (blue filled) is compared against Ground Truth Monte Carlo rollouts(black dashed lines). Characteristic flow trajectories transporting random noise samples (t = 0) to the target return distribution (… view at source ↗

**Figure 10.** Figure 10: Distributional Flow Analysis on the Discrete MC Environment. We visualize the learned PCBF return distributions across states s = 1 to s = 20. The estimated probability density of the flow-transported samples (blue filled) is compared against Ground Truth Monte Carlo rollouts(black dashed lines). Characteristic flow trajectories transporting random noise samples (t = 0) to the target return distribution (… view at source ↗

read the original abstract

Distributional reinforcement learning (DRL) models the full return distribution, but existing finite-support or quantile-based methods rely on projections, while recent flow-based approaches can suffer from \emph{boundary mismatch} at the flow source or from \emph{high-variance} bootstrapping when current and successor noises are independent. We propose Path-Coupled Bellman Flows (PCBF), a continuous-time DRL method that learns return distributions with flow matching using \textbf{source-consistent Bellman-coupled paths}: the current path starts from the required base prior at $t{=}0$, reaches the Bellman target at $t{=}1$, and maintains a pathwise affine relation to the successor flow at intermediate times (without requiring time-$t$ marginals to satisfy a distributional Bellman fixed point for all $t$). PCBF couples current and successor return flows through shared base noise and uses a $\lambda$-parameterized control-variate target: $\lambda{=}0$ recovers an unbiased sample Bellman target, while $\lambda{>}0$ trades controlled bias for variance reduction. Experiments on analytically tractable MRPs, OGBench, and D4RL show improved distributional fidelity and training stability, and competitive offline RL performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PCBF adds pathwise affine coupling and a lambda control variate to flow-based distributional RL, but the decision to skip intermediate Bellman fixed points leaves a gap that needs checking.

read the letter

PCBF couples current and successor return flows through shared base noise and a pathwise affine relation that holds at intermediate times, while only enforcing the Bellman target at t=1. The lambda parameter lets you recover an unbiased sample target or trade some bias for lower variance. This directly targets the boundary mismatch and independent-noise variance problems called out for earlier flow methods.

The construction is new relative to the limitations described. Experiments on analytically tractable MRPs plus OGBench and D4RL report better distributional fidelity, training stability, and competitive offline RL performance. That gives concrete evidence the approach can be implemented and run on standard benchmarks.

The soft spot is the central modeling choice. By design the method does not require time-t marginals to satisfy the distributional Bellman fixed point, yet still claims the terminal flows solve the correct equation. The stress-test note flags that the flow-matching loss may not simultaneously enforce the affine relation and convergence to the right distribution; the abstract supplies no derivation showing uniqueness or absence of drift. Without error bars, lambda ablations, or a proof sketch, it is hard to tell how much of the reported fidelity comes from the coupling versus other implementation details. Lambda is treated as a free hyperparameter, which is reasonable but leaves open how sensitive results are to its value.

This is for people already working on continuous distributional RL. A reader focused on variance reduction in bootstrapped targets could extract the lambda idea. The paper has a clear technical proposal and empirical results on accepted benchmarks, so it deserves peer review even if the theory section needs expansion to address the intermediate-marginal concern.

Referee Report

2 major / 2 minor

Summary. The paper proposes Path-Coupled Bellman Flows (PCBF), a continuous-time distributional RL method based on flow matching. It introduces source-consistent Bellman-coupled paths that start from the base prior at t=0, reach the Bellman target at t=1, and maintain a pathwise affine relation between current and successor flows at intermediate times via shared base noise and a λ-parameterized control-variate target (λ=0 recovers unbiased sampling; λ>0 trades bias for variance reduction). The method claims this works without requiring time-t marginals to obey the distributional Bellman fixed point for all t, yielding improved distributional fidelity, training stability, and competitive offline RL performance on analytically tractable MRPs, OGBench, and D4RL.

Significance. If the construction is correct, PCBF could meaningfully advance flow-based DRL by mitigating boundary mismatch and high-variance bootstrapping through coupled paths and control variates. The multi-benchmark evaluation (including analytically tractable cases) is a strength for assessing distributional fidelity. However, the absence of detailed derivations, error bars, and ablations in the presented material limits immediate impact assessment.

major comments (2)

[Abstract] Abstract: The central claim that a pathwise affine relation between current and successor flows can be maintained for t in (0,1) 'without requiring time-t marginals to satisfy a distributional Bellman fixed point for all t' is load-bearing for novelty and correctness. No derivation is supplied showing that the flow-matching loss simultaneously enforces the affine relation and convergence to the unique solution of the distributional Bellman equation at t=1; the skeptic concern that intermediate marginals may drift therefore remains unaddressed.
[Experiments] Experiments (implied by abstract claims): Reported gains in distributional fidelity and stability on MRPs, OGBench, and D4RL are presented without error bars, ablation results on λ, or direct comparison of λ>0 versus λ=0 targets. This makes it impossible to isolate the contribution of the control-variate mechanism or confirm that performance is not driven by the shared-noise coupling alone.

minor comments (2)

[Method] Notation for the λ-parameterized target and the precise form of the affine coupling (e.g., how the shared base noise enters the ODE) should be stated explicitly with an equation reference rather than described only in prose.
[Abstract] The abstract mentions 'analytically tractable MRPs' but does not specify which MRPs or what exact metrics (e.g., Wasserstein distance to ground-truth return distribution) were used; this should be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to strengthen both the theoretical justification and the experimental reporting.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a pathwise affine relation between current and successor flows can be maintained for t in (0,1) 'without requiring time-t marginals to satisfy a distributional Bellman fixed point for all t' is load-bearing for novelty and correctness. No derivation is supplied showing that the flow-matching loss simultaneously enforces the affine relation and convergence to the unique solution of the distributional Bellman equation at t=1; the skeptic concern that intermediate marginals may drift therefore remains unaddressed.

Authors: We agree that an explicit derivation is necessary to substantiate the claim. In the revised manuscript we will insert a dedicated subsection (likely in Section 3) that derives how the flow-matching objective, when applied to source-consistent Bellman-coupled paths with shared base noise, maintains the required pathwise affine relation for all t in (0,1) while the t=1 boundary condition alone guarantees convergence to the unique distributional Bellman solution. The coupling construction prevents marginal drift by ensuring that any deviation at intermediate times is corrected through the shared-noise transport, without imposing the fixed-point condition at every t. revision: yes
Referee: [Experiments] Experiments (implied by abstract claims): Reported gains in distributional fidelity and stability on MRPs, OGBench, and D4RL are presented without error bars, ablation results on λ, or direct comparison of λ>0 versus λ=0 targets. This makes it impossible to isolate the contribution of the control-variate mechanism or confirm that performance is not driven by the shared-noise coupling alone.

Authors: We concur that the current experimental section lacks the statistical detail needed to isolate the control-variate contribution. The revised paper will report all metrics with error bars computed over at least five independent random seeds, include a full ablation table varying λ across {0, 0.1, 0.5, 1.0}, and add side-by-side plots comparing λ=0 (unbiased) versus λ>0 (control-variate) targets on the same MRP and D4RL tasks to quantify the variance-reduction effect attributable to the λ-parameterized target. revision: yes

Circularity Check

0 steps flagged

PCBF derivation is self-contained with no circular reductions

full rationale

The paper defines PCBF via explicit construction of source-consistent Bellman-coupled paths that enforce the pathwise affine relation and shared base noise by design, together with a tunable λ control-variate target whose bias-variance tradeoff is stated directly. Performance claims rest on empirical results across analytically tractable MRPs, OGBench, and D4RL rather than any reduction of the reported fidelity or stability to a fitted parameter or self-referential equation. No self-citations, uniqueness theorems, or ansatzes imported from prior work appear in the derivation; the central modeling choice (affine coupling without intermediate-t fixed-point enforcement) is presented as an explicit design decision whose consequences are evaluated externally.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on one new domain assumption about path coupling and one explicit tunable parameter; no new physical entities are postulated.

free parameters (1)

λ
Scalar that trades bias for variance reduction in the control-variate target; its value is chosen by the practitioner.

axioms (1)

domain assumption Pathwise affine relation between current and successor flows can be maintained at intermediate times without the time-t marginals satisfying the distributional Bellman equation for all t
Stated as part of the source-consistent Bellman-coupled path construction.

invented entities (1)

source-consistent Bellman-coupled paths no independent evidence
purpose: To enforce coupling between current and successor return flows via shared base noise and affine relation
Newly introduced mechanism whose validity is assumed for the method.

pith-pipeline@v0.9.1-grok · 5750 in / 1465 out tokens · 38433 ms · 2026-06-30T23:05:57.919281+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages

[1]

Bellman at each t

PMLR, 2022. Jennewein, D. M., Lee, J., Kurtz, C., Dizon, W., Shaeffer, I., Chapman, A., Chiquete, A., Burks, J., Carlson, A., Mason, N., Kobwala, A., Jagadeesan, T., Barghav, P., Bat- telle, T., Belshe, R., McCaffrey, D., Brazil, M., Inumella, C., Kuznia, K., Buzinski, J., Dudley, S., Shah, D., Speyer, G., and Yalim, J. The Sol Supercomputer at Arizona St...

work page doi:10.1145/3569951.3597573 2022
[2]

Proof of Proposition 5.3:

The explicit kernel formula (17) is the independent-source Gaussian special case obtained by change-of-variables usingu(x, x ′ 1, r, t)from (16) and the Jacobian1/(1−t). Proof of Proposition 5.3:
[3]

Att= 0, both interpolants in (14)–(15) reduce to the base noiseX 0, henceP s,a,0 =N(0,1)
[4]

As t→1 , we have Xt =t(R+γX ′
[5]

The weak convergence of Ps,a(·, t) follows, and pointwise convergence of densities holds under dominated convergence when the limit law is absolutely continuous

+ (1−t)X 0 →R+γX ′ 1 pointwise, hence Xt ⇒R+γX ′ 1 in distribution. The weak convergence of Ps,a(·, t) follows, and pointwise convergence of densities holds under dominated convergence when the limit law is absolutely continuous. C.2. Properties of the Posterior Operator The posterior operatorB s,a defined in (18) satisfies:
[6]

14 Path-Coupled Bellman Flows for Distributional Reinforcement Learning

(Linearity)B s,a[αg1 +βg 2](x, t) =αB s,a[g1](x, t) +βB s,a[g2](x, t). 14 Path-Coupled Bellman Flows for Distributional Reinforcement Learning
[7]

(Tower property) If h=h(X t, S, A) is measurable w.r.t.(Xt, S, A), then Bs,a[h·g](x, t) =h(x, s, a)B s,a[g](x, t) andE[B s,a[g](Xt, t)|S=s, A=a] =E[g|S=s, A=a]
[8]

These follow from standard properties of conditional expectation; Eqs

(Bayes form, independent-source special case) In the independent-source Gaussian case, Bayes’ rule yields: Ps,a(x, t) = eBs,a[1](x, t),(24) Bs,a[g](x, t) = eBs,a[g](x, t) eBs,a[1](x, t) = eBs,a[g](x, t) Ps,a(x, t) .(25) This is the continuous form of Bayes’ rule: the density Ps,a normalizes the unnormalized posterior to yield the conditional expectationB ...
[9]

Form Z ′ t :=tZ ′ 1 + (1−t)X ′ 0, Zt :=t(r+γZ ′

=ρ∈[−1,1] . Form Z ′ t :=tZ ′ 1 + (1−t)X ′ 0, Zt :=t(r+γZ ′
[10]

Derivation.Write Z ′ 1 =µ+σW with W∼ N(0,1) , and represent (X0, X′

+ (1−t)X 0, let ¯v⋆(·, t) be the population successor velocity, and setC:= ¯v⋆(Z ′ t, t)−(Z ′ 1 −X ′ 0). Derivation.Write Z ′ 1 =µ+σW with W∼ N(0,1) , and represent (X0, X′
[11]

rejection sampling candidates

as X ′ 0 =V ′, X0 =ρV ′ + p 1−ρ 2 V with V, V ′ ∼ N(0,1) independent, W⊥(V, V ′). For Z ′ t =t(µ+σW) + (1−t)V ′, the Gaussian regression formula gives ¯v⋆(z′, t) =E[Z ′ 1 −X ′ 0 |Z ′ t =z ′] =µ+β(t, σ)(z ′ −tµ) , with β(t, σ) = (tσ 2 −(1−t))/(t 2σ2 + (1−t) 2). Substituting z′ =Z ′ t and subtracting Z ′ 1 −X ′ 0 yields the linear form C=a(t, σ)W+b(t, σ)V ′...

2018
[12]

, where the expression denotes the binary expansion of a number in [0,2] and each digit is an independent Bernoulli random variable

The discounted return is therefore G= ∞X t=0 γtRt =R 0 + 1 2 R1 + 1 4 R2 +· · ·, A key observation is thatGadmits a binary expansion G=R 0.R1R2 . . . , where the expression denotes the binary expansion of a number in [0,2] and each digit is an independent Bernoulli random variable. As a consequence, the support of G is the interval [0,2] , with 0 correspo...

2024

[1] [1]

Bellman at each t

PMLR, 2022. Jennewein, D. M., Lee, J., Kurtz, C., Dizon, W., Shaeffer, I., Chapman, A., Chiquete, A., Burks, J., Carlson, A., Mason, N., Kobwala, A., Jagadeesan, T., Barghav, P., Bat- telle, T., Belshe, R., McCaffrey, D., Brazil, M., Inumella, C., Kuznia, K., Buzinski, J., Dudley, S., Shah, D., Speyer, G., and Yalim, J. The Sol Supercomputer at Arizona St...

work page doi:10.1145/3569951.3597573 2022

[2] [2]

Proof of Proposition 5.3:

The explicit kernel formula (17) is the independent-source Gaussian special case obtained by change-of-variables usingu(x, x ′ 1, r, t)from (16) and the Jacobian1/(1−t). Proof of Proposition 5.3:

[3] [3]

Att= 0, both interpolants in (14)–(15) reduce to the base noiseX 0, henceP s,a,0 =N(0,1)

[4] [4]

As t→1 , we have Xt =t(R+γX ′

[5] [5]

The weak convergence of Ps,a(·, t) follows, and pointwise convergence of densities holds under dominated convergence when the limit law is absolutely continuous

+ (1−t)X 0 →R+γX ′ 1 pointwise, hence Xt ⇒R+γX ′ 1 in distribution. The weak convergence of Ps,a(·, t) follows, and pointwise convergence of densities holds under dominated convergence when the limit law is absolutely continuous. C.2. Properties of the Posterior Operator The posterior operatorB s,a defined in (18) satisfies:

[6] [6]

14 Path-Coupled Bellman Flows for Distributional Reinforcement Learning

(Linearity)B s,a[αg1 +βg 2](x, t) =αB s,a[g1](x, t) +βB s,a[g2](x, t). 14 Path-Coupled Bellman Flows for Distributional Reinforcement Learning

[7] [7]

(Tower property) If h=h(X t, S, A) is measurable w.r.t.(Xt, S, A), then Bs,a[h·g](x, t) =h(x, s, a)B s,a[g](x, t) andE[B s,a[g](Xt, t)|S=s, A=a] =E[g|S=s, A=a]

[8] [8]

These follow from standard properties of conditional expectation; Eqs

(Bayes form, independent-source special case) In the independent-source Gaussian case, Bayes’ rule yields: Ps,a(x, t) = eBs,a[1](x, t),(24) Bs,a[g](x, t) = eBs,a[g](x, t) eBs,a[1](x, t) = eBs,a[g](x, t) Ps,a(x, t) .(25) This is the continuous form of Bayes’ rule: the density Ps,a normalizes the unnormalized posterior to yield the conditional expectationB ...

[9] [9]

Form Z ′ t :=tZ ′ 1 + (1−t)X ′ 0, Zt :=t(r+γZ ′

=ρ∈[−1,1] . Form Z ′ t :=tZ ′ 1 + (1−t)X ′ 0, Zt :=t(r+γZ ′

[10] [10]

Derivation.Write Z ′ 1 =µ+σW with W∼ N(0,1) , and represent (X0, X′

+ (1−t)X 0, let ¯v⋆(·, t) be the population successor velocity, and setC:= ¯v⋆(Z ′ t, t)−(Z ′ 1 −X ′ 0). Derivation.Write Z ′ 1 =µ+σW with W∼ N(0,1) , and represent (X0, X′

[11] [11]

rejection sampling candidates

as X ′ 0 =V ′, X0 =ρV ′ + p 1−ρ 2 V with V, V ′ ∼ N(0,1) independent, W⊥(V, V ′). For Z ′ t =t(µ+σW) + (1−t)V ′, the Gaussian regression formula gives ¯v⋆(z′, t) =E[Z ′ 1 −X ′ 0 |Z ′ t =z ′] =µ+β(t, σ)(z ′ −tµ) , with β(t, σ) = (tσ 2 −(1−t))/(t 2σ2 + (1−t) 2). Substituting z′ =Z ′ t and subtracting Z ′ 1 −X ′ 0 yields the linear form C=a(t, σ)W+b(t, σ)V ′...

2018

[12] [12]

, where the expression denotes the binary expansion of a number in [0,2] and each digit is an independent Bernoulli random variable

The discounted return is therefore G= ∞X t=0 γtRt =R 0 + 1 2 R1 + 1 4 R2 +· · ·, A key observation is thatGadmits a binary expansion G=R 0.R1R2 . . . , where the expression denotes the binary expansion of a number in [0,2] and each digit is an independent Bernoulli random variable. As a consequence, the support of G is the interval [0,2] , with 0 correspo...

2024