Regime-Based Portfolio Allocation Using Hidden Markov Models and Reinforcement Learning
Pith reviewed 2026-06-29 09:24 UTC · model grok-4.3
The pith
Reinforcement learning applied to three-state HMM regimes delivers the highest Sharpe ratio and lowest drawdowns for equities, Treasuries and gold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Estimating a three-state Gaussian HMM on the three ETFs identifies persistent regimes whose conditional return dynamics are economically distinct; an RL policy trained on these states then produces the strongest out-of-sample Sharpe ratio and materially lower drawdowns relative to both the passive benchmark and rule-based HMM allocations, with the performance advantage arising from regime-conditioned choices that favor SPY in stable periods and TLT or GLD in stressed periods.
What carries the argument
A three-state Gaussian Hidden Markov Model that supplies discrete regime labels to a reinforcement learning policy which maps each label to portfolio weights.
If this is right
- HMM-based allocations, including the RL variant, outperform the passive SPY benchmark.
- The RL policy records the highest Sharpe ratio and materially lower drawdowns.
- The allocation rules remain fully interpretable through discrete regime-dependent actions.
- Sensitivity checks confirm greater robustness for the three-state specification than for two-state alternatives.
Where Pith is reading between the lines
- The same HMM-RL pairing could be tested on other asset classes or at higher frequencies to check whether regime persistence generalizes.
- Because the policy stays interpretable, it supplies a transparent baseline against which more complex allocation models can be compared.
- If regime labels prove stable, the framework could be extended to include additional signals while preserving the discrete-state structure.
Load-bearing premise
The three-state Gaussian HMM identifies persistent, economically meaningful regimes whose conditional return dynamics remain stable enough for the RL policy to generalize out of sample.
What would settle it
In data after 2025 the RL policy no longer records the highest Sharpe ratio or the regimes lose persistence and the conditional return patterns shift substantially.
Figures
read the original abstract
This study develops a regime-aware portfolio allocation framework that integrates Markov switching models with Reinforcement Learning (RL) to dynamically allocate across equities (SPY), long-term Treasuries (TLT), and gold (GLD). Using daily ETF data from 2004-2025, we first characterize market behavior through a discrete Markov chain and then estimate a three-state Gaussian Hidden Markov Model (HMM) selected by the Bayesian Information Criterion (BIC). The estimated regimes-low-volatility, transitional, and high-volatility-exhibit strong persistence and state-dependent return dynamics consistent with recent findings on nonlinear market states (Ardia et al., 2024; Gupta & Pierdzioch, 2023). State-conditional analysis shows that SPY dominates in stable regimes, while TLT and GLD provide protection during stressed periods, motivating regime-conditioned allocation rules. We evaluate rule-based rotation and RL-driven strategies using a 30% out-of-sample test window with a one-day execution lag to avoid look-ahead bias. Both HMM-based allocations outperform a passive SPY benchmark, while the RL policy achieves the highest risk-adjusted performance, delivering the strongest Sharpe ratio and materially lower drawdowns, yet remains fully interpretable through discrete regime-dependent actions. Sensitivity analysis confirms the robustness of the three-state specification relative to two-state alternatives. Overall, the results demonstrate that RL can systematically enhance HMM-based regime detection, providing a transparent, adaptive, and empirically grounded framework for tactical asset allocation. The combined HMM-RL system provides a transparent, rules-based approach to tactical allocation that improves risk-adjusted performance relative to standard benchmark strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a regime-aware portfolio allocation framework that integrates a three-state Gaussian Hidden Markov Model (HMM), selected via BIC, with Reinforcement Learning (RL) to dynamically allocate across SPY, TLT, and GLD using 2004-2025 daily ETF data. Regimes are characterized as low-volatility, transitional, and high-volatility with state-dependent returns; rule-based and RL strategies are evaluated on a 30% out-of-sample window with one-day execution lag, claiming that the RL policy delivers the highest Sharpe ratio and materially lower drawdowns while remaining interpretable via discrete regime actions. Sensitivity analysis supports the three-state choice over two-state alternatives.
Significance. If the out-of-sample claims hold after correcting for potential regime contamination, the work would offer a transparent, rules-based HMM-RL hybrid for tactical allocation that improves risk-adjusted performance over passive benchmarks and builds on existing nonlinear market-state literature with explicit robustness checks.
major comments (1)
- [Abstract] Abstract: the three-state Gaussian HMM is described as estimated on the full 2004-2025 sample before the 30% out-of-sample split is applied. If HMM parameters (transition matrix and emission distributions) are not refit exclusively on the training window, the decoded states supplied to the RL policy in the test period incorporate future return information. This directly threatens the central claim that the RL policy's superior Sharpe ratio and lower drawdowns reflect genuine generalization rather than in-sample regime knowledge, despite the one-day lag.
minor comments (2)
- [Abstract] Abstract: no numerical performance values, confidence intervals, or error bars are reported for the claimed 'strongest Sharpe ratio' or 'materially lower drawdowns,' preventing quantitative assessment of economic significance relative to the SPY benchmark.
- The manuscript states that 'sensitivity analysis confirms the robustness of the three-state specification' but provides no details on the alternative models tested, the metrics compared, or the outcome of those checks.
Simulated Author's Rebuttal
We thank the referee for highlighting a critical methodological issue concerning potential look-ahead bias in the HMM estimation. We address the comment directly below and commit to a revision that eliminates the concern while preserving the core contribution.
read point-by-point responses
-
Referee: [Abstract] Abstract: the three-state Gaussian HMM is described as estimated on the full 2004-2025 sample before the 30% out-of-sample split is applied. If HMM parameters (transition matrix and emission distributions) are not refit exclusively on the training window, the decoded states supplied to the RL policy in the test period incorporate future return information. This directly threatens the central claim that the RL policy's superior Sharpe ratio and lower drawdowns reflect genuine generalization rather than in-sample regime knowledge, despite the one-day lag.
Authors: We agree that the current description and implementation risk introducing future information into out-of-sample state decoding. To correct this, we will (i) re-estimate all HMM parameters (transition probabilities and emission distributions) exclusively on the 70% training window, (ii) decode regimes in the 30% test window using only the training-fitted model, and (iii) re-run the RL training and all performance comparisons under this strictly causal protocol. The abstract, methodology, and results sections will be updated to document the revised procedure and the new out-of-sample metrics. We expect the qualitative conclusions to remain intact, but the revised numbers will be reported. revision: yes
Circularity Check
HMM parameters estimated on full 2004-2025 sample, so out-of-sample regime labels and RL performance are fitted inputs
specific steps
-
fitted input called prediction
[Abstract]
"Using daily ETF data from 2004-2025, we first characterize market behavior through a discrete Markov chain and then estimate a three-state Gaussian Hidden Markov Model (HMM) selected by the Bayesian Information Criterion (BIC). ... We evaluate rule-based rotation and RL-driven strategies using a 30% out-of-sample test window with a one-day execution lag to avoid look-ahead bias."
HMM estimation occurs on the full sample that includes the 30% test window. Consequently the state sequence decoded for the test period (and fed to the RL policy) is conditioned on returns that occur after the test start date, so the measured Sharpe ratio and drawdown reductions are statistically forced by the in-sample fit rather than generated by an independent out-of-sample regime process.
full rationale
The paper's central claim is that the RL policy delivers superior out-of-sample Sharpe and drawdowns on a held-out 30% window. However, the three-state Gaussian HMM is explicitly estimated on the entire 2004-2025 dataset before the split; decoded states and transition probabilities supplied to the RL agent in the test window therefore embed future return information. This reduces the reported 'prediction' of risk-adjusted performance to a quantity whose inputs already contain the test-period data, violating the one-day lag's intent to prevent look-ahead bias. The step matches the 'fitted_input_called_prediction' pattern exactly.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of hidden states
- HMM transition and emission parameters
axioms (2)
- domain assumption Returns are generated by a hidden Markov process with Gaussian emissions
- domain assumption Market regimes exhibit sufficient persistence for one-day-ahead allocation to be useful
Reference graph
Works this paper leans on
-
[1]
Baur, D., & Lucey, B. (2010). Is Gold a Safe Haven? International Evidence. The Financial Review. Bellman, Richard E. Dynamic Programming. Princeton University Press,
2010
-
[2]
Reinforcement Learning for Financial Portfolios: An Overview
Charpentier, Arthur, Mathieu Laurière, and Quentin Sabatelli. “Reinforcement Learning for Financial Portfolios: An Overview.” arXiv:2104.02867,
-
[3]
Modelling Volatility Clustering and Regime Switching in Financial Markets
Enow, S. T. Exploring Volatility Clustering Financial Markets and Its Implication . Journal of Economic and Social Development: Resilient Society, 2023 Enow, S. T., & Ndlovu, E. “Modelling Volatility Clustering and Regime Switching in Financial Markets.” Journal of Risk and Financial Management, vol. 16, no. 1,
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.