Regime-Based Portfolio Allocation Using Hidden Markov Models and Reinforcement Learning

Ajay Kumar Verma; Neo Paul Lesupi; Nunik Srikandi Putri

arxiv: 2605.27848 · v1 · pith:EWJWWZR7new · submitted 2026-05-27 · 💱 q-fin.PM · econ.EM· q-fin.CP· q-fin.MF· q-fin.ST

Regime-Based Portfolio Allocation Using Hidden Markov Models and Reinforcement Learning

Ajay Kumar Verma , Nunik Srikandi Putri , Neo Paul Lesupi This is my paper

Pith reviewed 2026-06-29 09:24 UTC · model grok-4.3

classification 💱 q-fin.PM econ.EMq-fin.CPq-fin.MFq-fin.ST

keywords hidden Markov modelreinforcement learningportfolio allocationmarket regimesSharpe ratiotactical asset allocationdrawdowns

0 comments

The pith

Reinforcement learning applied to three-state HMM regimes delivers the highest Sharpe ratio and lowest drawdowns for equities, Treasuries and gold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a framework that first fits a three-state Gaussian Hidden Markov Model to daily returns of SPY, TLT and GLD to label low-volatility, transitional and high-volatility regimes, then trains a reinforcement learning policy to choose allocations conditioned on the current regime. The central claim is that this HMM-RL combination produces superior risk-adjusted performance compared with a passive SPY benchmark and with simpler rule-based rotation strategies, while the decisions remain transparent because they are expressed as discrete regime-dependent actions. A sympathetic reader would care because the approach offers a concrete, testable method for making tactical asset allocation respond to changing market conditions without requiring opaque models. Results rest on 2004-2025 ETF data with a 30 percent out-of-sample window and a one-day execution lag.

Core claim

Estimating a three-state Gaussian HMM on the three ETFs identifies persistent regimes whose conditional return dynamics are economically distinct; an RL policy trained on these states then produces the strongest out-of-sample Sharpe ratio and materially lower drawdowns relative to both the passive benchmark and rule-based HMM allocations, with the performance advantage arising from regime-conditioned choices that favor SPY in stable periods and TLT or GLD in stressed periods.

What carries the argument

A three-state Gaussian Hidden Markov Model that supplies discrete regime labels to a reinforcement learning policy which maps each label to portfolio weights.

If this is right

HMM-based allocations, including the RL variant, outperform the passive SPY benchmark.
The RL policy records the highest Sharpe ratio and materially lower drawdowns.
The allocation rules remain fully interpretable through discrete regime-dependent actions.
Sensitivity checks confirm greater robustness for the three-state specification than for two-state alternatives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same HMM-RL pairing could be tested on other asset classes or at higher frequencies to check whether regime persistence generalizes.
Because the policy stays interpretable, it supplies a transparent baseline against which more complex allocation models can be compared.
If regime labels prove stable, the framework could be extended to include additional signals while preserving the discrete-state structure.

Load-bearing premise

The three-state Gaussian HMM identifies persistent, economically meaningful regimes whose conditional return dynamics remain stable enough for the RL policy to generalize out of sample.

What would settle it

In data after 2025 the RL policy no longer records the highest Sharpe ratio or the regimes lose persistence and the conditional return patterns shift substantially.

Figures

Figures reproduced from arXiv: 2605.27848 by Ajay Kumar Verma, Neo Paul Lesupi, Nunik Srikandi Putri.

**Figure 1.** Figure 1: Daily Log Returns of SPY, TLT, and GLD This figure justifies why regime-switching is essential (vol spikes = regime changes) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

read the original abstract

This study develops a regime-aware portfolio allocation framework that integrates Markov switching models with Reinforcement Learning (RL) to dynamically allocate across equities (SPY), long-term Treasuries (TLT), and gold (GLD). Using daily ETF data from 2004-2025, we first characterize market behavior through a discrete Markov chain and then estimate a three-state Gaussian Hidden Markov Model (HMM) selected by the Bayesian Information Criterion (BIC). The estimated regimes-low-volatility, transitional, and high-volatility-exhibit strong persistence and state-dependent return dynamics consistent with recent findings on nonlinear market states (Ardia et al., 2024; Gupta & Pierdzioch, 2023). State-conditional analysis shows that SPY dominates in stable regimes, while TLT and GLD provide protection during stressed periods, motivating regime-conditioned allocation rules. We evaluate rule-based rotation and RL-driven strategies using a 30% out-of-sample test window with a one-day execution lag to avoid look-ahead bias. Both HMM-based allocations outperform a passive SPY benchmark, while the RL policy achieves the highest risk-adjusted performance, delivering the strongest Sharpe ratio and materially lower drawdowns, yet remains fully interpretable through discrete regime-dependent actions. Sensitivity analysis confirms the robustness of the three-state specification relative to two-state alternatives. Overall, the results demonstrate that RL can systematically enhance HMM-based regime detection, providing a transparent, adaptive, and empirically grounded framework for tactical asset allocation. The combined HMM-RL system provides a transparent, rules-based approach to tactical allocation that improves risk-adjusted performance relative to standard benchmark strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable HMM-plus-RL allocation on three ETFs but the out-of-sample test is likely invalid because the HMM was probably fit on the full 2004-2025 sample.

read the letter

The main things to know are that this paper gives a concrete, rules-based way to combine HMM regime detection with RL for allocating across SPY, TLT, and GLD, and that the claimed out-of-sample gains rest on shaky ground.

They estimate a three-state Gaussian HMM chosen by BIC on daily data from 2004-2025, label the states as low-vol, transitional, and high-vol, and show that equities dominate in the stable state while bonds and gold help in stressed ones. They then train an RL policy that learns regime-conditioned actions and test it against a passive benchmark and simple rotation rules on a 30% hold-out window with a one-day lag. The RL version reportedly delivers the best Sharpe and smaller drawdowns while staying interpretable.

This is a straightforward integration of two standard tools on a common asset set. The regime descriptions line up with other recent work on nonlinear market states, and the sensitivity check on two-state models is a small plus. The practical framing for tactical allocation is clear.

The soft spot is the sample handling. The abstract describes fitting the HMM on the entire period before splitting for the test window. If that is what happened, the decoded states and transition probabilities in the out-of-sample period contain future return information. That undercuts the one-day lag's purpose and makes the superior RL performance hard to interpret as genuine generalization. The abstract gives no numbers or error bars, so the size of any real improvement is also unclear.

If the full text shows the HMM was fit only on the training window, the concern disappears. Otherwise the central claim needs re-running.

The paper is aimed at quant practitioners who want an implementable regime-aware system. A reader working on applied portfolio methods would find the comparison useful. It deserves a serious referee to check the exact training split and any code or tables that clarify the numbers. I would send it for review with a request to confirm or correct the HMM estimation window.

Referee Report

1 major / 2 minor

Summary. The paper develops a regime-aware portfolio allocation framework that integrates a three-state Gaussian Hidden Markov Model (HMM), selected via BIC, with Reinforcement Learning (RL) to dynamically allocate across SPY, TLT, and GLD using 2004-2025 daily ETF data. Regimes are characterized as low-volatility, transitional, and high-volatility with state-dependent returns; rule-based and RL strategies are evaluated on a 30% out-of-sample window with one-day execution lag, claiming that the RL policy delivers the highest Sharpe ratio and materially lower drawdowns while remaining interpretable via discrete regime actions. Sensitivity analysis supports the three-state choice over two-state alternatives.

Significance. If the out-of-sample claims hold after correcting for potential regime contamination, the work would offer a transparent, rules-based HMM-RL hybrid for tactical allocation that improves risk-adjusted performance over passive benchmarks and builds on existing nonlinear market-state literature with explicit robustness checks.

major comments (1)

[Abstract] Abstract: the three-state Gaussian HMM is described as estimated on the full 2004-2025 sample before the 30% out-of-sample split is applied. If HMM parameters (transition matrix and emission distributions) are not refit exclusively on the training window, the decoded states supplied to the RL policy in the test period incorporate future return information. This directly threatens the central claim that the RL policy's superior Sharpe ratio and lower drawdowns reflect genuine generalization rather than in-sample regime knowledge, despite the one-day lag.

minor comments (2)

[Abstract] Abstract: no numerical performance values, confidence intervals, or error bars are reported for the claimed 'strongest Sharpe ratio' or 'materially lower drawdowns,' preventing quantitative assessment of economic significance relative to the SPY benchmark.
The manuscript states that 'sensitivity analysis confirms the robustness of the three-state specification' but provides no details on the alternative models tested, the metrics compared, or the outcome of those checks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting a critical methodological issue concerning potential look-ahead bias in the HMM estimation. We address the comment directly below and commit to a revision that eliminates the concern while preserving the core contribution.

read point-by-point responses

Referee: [Abstract] Abstract: the three-state Gaussian HMM is described as estimated on the full 2004-2025 sample before the 30% out-of-sample split is applied. If HMM parameters (transition matrix and emission distributions) are not refit exclusively on the training window, the decoded states supplied to the RL policy in the test period incorporate future return information. This directly threatens the central claim that the RL policy's superior Sharpe ratio and lower drawdowns reflect genuine generalization rather than in-sample regime knowledge, despite the one-day lag.

Authors: We agree that the current description and implementation risk introducing future information into out-of-sample state decoding. To correct this, we will (i) re-estimate all HMM parameters (transition probabilities and emission distributions) exclusively on the 70% training window, (ii) decode regimes in the 30% test window using only the training-fitted model, and (iii) re-run the RL training and all performance comparisons under this strictly causal protocol. The abstract, methodology, and results sections will be updated to document the revised procedure and the new out-of-sample metrics. We expect the qualitative conclusions to remain intact, but the revised numbers will be reported. revision: yes

Circularity Check

1 steps flagged

HMM parameters estimated on full 2004-2025 sample, so out-of-sample regime labels and RL performance are fitted inputs

specific steps

fitted input called prediction [Abstract]
"Using daily ETF data from 2004-2025, we first characterize market behavior through a discrete Markov chain and then estimate a three-state Gaussian Hidden Markov Model (HMM) selected by the Bayesian Information Criterion (BIC). ... We evaluate rule-based rotation and RL-driven strategies using a 30% out-of-sample test window with a one-day execution lag to avoid look-ahead bias."

HMM estimation occurs on the full sample that includes the 30% test window. Consequently the state sequence decoded for the test period (and fed to the RL policy) is conditioned on returns that occur after the test start date, so the measured Sharpe ratio and drawdown reductions are statistically forced by the in-sample fit rather than generated by an independent out-of-sample regime process.

full rationale

The paper's central claim is that the RL policy delivers superior out-of-sample Sharpe and drawdowns on a held-out 30% window. However, the three-state Gaussian HMM is explicitly estimated on the entire 2004-2025 dataset before the split; decoded states and transition probabilities supplied to the RL agent in the test window therefore embed future return information. This reduces the reported 'prediction' of risk-adjusted performance to a quantity whose inputs already contain the test-period data, violating the one-day lag's intent to prevent look-ahead bias. The step matches the 'fitted_input_called_prediction' pattern exactly.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of a three-state Gaussian HMM whose parameters are estimated from returns and on the assumption that the RL policy learned on historical regime sequences will continue to produce superior out-of-sample performance.

free parameters (2)

number of hidden states
Chosen as three after BIC comparison; directly determines the regime labels used by the RL agent.
HMM transition and emission parameters
Maximum-likelihood estimates fitted to the 2004-2025 daily returns; control the state sequence fed to the allocator.

axioms (2)

domain assumption Returns are generated by a hidden Markov process with Gaussian emissions
Invoked when fitting the three-state model and when interpreting regime-conditional means and variances.
domain assumption Market regimes exhibit sufficient persistence for one-day-ahead allocation to be useful
Required for the RL policy to translate detected states into actionable weights without excessive turnover.

pith-pipeline@v0.9.1-grok · 5843 in / 1471 out tokens · 30455 ms · 2026-06-29T09:24:11.087560+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 1 canonical work pages

[1]

Baur, D., & Lucey, B. (2010). Is Gold a Safe Haven? International Evidence. The Financial Review. Bellman, Richard E. Dynamic Programming. Princeton University Press,

2010
[2]

Reinforcement Learning for Financial Portfolios: An Overview

Charpentier, Arthur, Mathieu Laurière, and Quentin Sabatelli. “Reinforcement Learning for Financial Portfolios: An Overview.” arXiv:2104.02867,

work page arXiv
[3]

Modelling Volatility Clustering and Regime Switching in Financial Markets

Enow, S. T. Exploring Volatility Clustering Financial Markets and Its Implication . Journal of Economic and Social Development: Resilient Society, 2023 Enow, S. T., & Ndlovu, E. “Modelling Volatility Clustering and Regime Switching in Financial Markets.” Journal of Risk and Financial Management, vol. 16, no. 1,

2023

[1] [1]

Baur, D., & Lucey, B. (2010). Is Gold a Safe Haven? International Evidence. The Financial Review. Bellman, Richard E. Dynamic Programming. Princeton University Press,

2010

[2] [2]

Reinforcement Learning for Financial Portfolios: An Overview

Charpentier, Arthur, Mathieu Laurière, and Quentin Sabatelli. “Reinforcement Learning for Financial Portfolios: An Overview.” arXiv:2104.02867,

work page arXiv

[3] [3]

Modelling Volatility Clustering and Regime Switching in Financial Markets

Enow, S. T. Exploring Volatility Clustering Financial Markets and Its Implication . Journal of Economic and Social Development: Resilient Society, 2023 Enow, S. T., & Ndlovu, E. “Modelling Volatility Clustering and Regime Switching in Financial Markets.” Journal of Risk and Financial Management, vol. 16, no. 1,

2023