pith. sign in

arxiv: 2607.02315 · v1 · pith:VF2EMNYDnew · submitted 2026-07-02 · 💻 cs.NE

Hybridizing a Grouping Metaheuristic with Reinforcement Learning for the One-Dimensional Bin Packing Problem

Pith reviewed 2026-07-03 02:44 UTC · model grok-4.3

classification 💻 cs.NE
keywords one-dimensional bin packinghybrid genetic algorithmQ-learningreinforcement learningoperator selectioncombinatorial optimizationmetaheuristic
0
0 comments X

The pith

A Q-learning agent selecting operators for a grouping genetic algorithm solves bin packing instances with near-baseline quality at 50 times lower runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RL-HGGA, which adds a tabular Q-learning controller to Falkenauer's Hybrid Grouping Genetic Algorithm for the one-dimensional bin packing problem. The controller replaces fixed operator probabilities with dynamic selection among eight macro-actions, guided by an eight-dimensional state vector that tracks generation progress, stagnation, optimality gap, fitness statistics, and bin fill rates. On standard benchmark families the method reaches an average optimality gap of 0.95 percent, close to the 0.75 percent of the original HGGA, while reducing mean computation time from 64.22 seconds to 1.29 seconds.

Core claim

RL-HGGA integrates a Q-learning agent that chooses among eight macro-actions (BPCX crossover, light and heavy mutation, Martello-Toth local search, and population restart) according to an eight-dimensional state vector encoding generation progress, stagnation level, optimality gap, average fitness, population variance, and average bin fill rate. The agent is trained with an epsilon-greedy policy over 400 episodes. On the Falkenauer T/U, Scholl 1-3, and Hard28 families this yields an average optimality gap of 0.95 percent while cutting mean runtime by a factor of fifty relative to the static HGGA baseline.

What carries the argument

The tabular Q-learning agent that maps an eight-dimensional state vector of search dynamics to selection among eight macro-actions for adaptive operator control.

If this is right

  • Learned adaptive operator selection can replace static probabilities while preserving solution quality on standard 1D-BPP benchmarks.
  • The hybrid method reduces mean computation time from 64.22 s to 1.29 s across the evaluated families.
  • The same state-action design yields an average optimality gap of 0.95 percent, competitive with the 0.75 percent gap of the unmodified HGGA.
  • Epsilon-greedy training over 400 episodes suffices to learn effective control policies for these instance sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reinforcement-learning controller could be attached to other grouping genetic algorithms for related packing or partitioning problems.
  • If the eight state dimensions prove insufficient on larger or differently distributed instances, the state vector would need explicit extension.
  • The observed fifty-fold speedup may allow practical solution of instances too large for the original HGGA under fixed time limits.

Load-bearing premise

The eight-dimensional state vector and eight macro-actions capture all decision-relevant aspects of the search dynamics across the tested instance distributions.

What would settle it

Evaluating RL-HGGA against HGGA on a fresh collection of bin-packing instances drawn from a different distribution and finding that the optimality-gap or runtime advantage disappears would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2607.02315 by Badaoui Ikram, Bousdjira Nadine, Hasnaoui Sarah, Mostefai Mounir Sofiane, Tati Youcef, Zitouni Rania.

Figure 1
Figure 1. Figure 1: Detailed architecture of RL-HGGA and placement of the ML component within the evolutionary loop. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Average gap to the lower bound per family: comparison between FFD, HGGA, and RL-HGGA. Lower values indicate better solutions. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Average number of bins used per family: comparison between FFD, HGGA, and RL-HGGA. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of average execution times per family between FFD, HGGA, and RL-HGGA. The logarithmic scale allows visualization of large time [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RL-HGGA training progression: evolution of the training gap and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of macro-actions learned by RL-HGGA under the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of convergence behavior between HGGA and RL [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

The one-dimensional bin packing problem (1D-BPP) is a canonical NP-hard combinatorial optimization problem with broad industrial applications. We propose RL-HGGA, a hybrid algorithm that integrates Falkenauer's Hybrid Grouping Genetic Algorithm (HGGA) with a tabular Q-learning controller. Rather than applying genetic operators at fixed probabilities, a Q-learning agent dynamically selects among eight macro-actions -- including BPCX crossover, light and heavy mutation, Martello-Toth local search, and population restart -- based on an eight-dimensional state representation encoding generation progress, stagnation level, optimality gap, average fitness, population variance, and average bin fill rate. The agent is trained with an epsilon-greedy policy over 400 episodes, with epsilon decaying to 0.05. Experiments on standard benchmark families (Falkenauer T/U, Scholl 1-3, Hard28) show that RL-HGGA achieves an average optimality gap of 0.95% -- competitive with HGGA (0.75%) and well below FFD (2.47%) -- while reducing mean computation time from 64.22 s to 1.29 s, a 50x speedup. These results demonstrate that learned adaptive operator selection can achieve near-HGGA solution quality at a fraction of the computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes RL-HGGA, which augments Falkenauer's Hybrid Grouping Genetic Algorithm (HGGA) with a tabular Q-learning agent that dynamically chooses among eight macro-actions (BPCX crossover, light/heavy mutation, Martello-Toth local search, restart, etc.) conditioned on an eight-dimensional state vector (generation progress, stagnation, optimality gap, average fitness, population variance, average bin fill rate, plus two unspecified components). Trained for 400 epsilon-greedy episodes (epsilon decaying to 0.05), the hybrid is evaluated on the Falkenauer T/U, Scholl 1-3, and Hard28 benchmark families, where it reports a mean optimality gap of 0.95 % (versus 0.75 % for HGGA and 2.47 % for FFD) together with a reduction in mean wall-clock time from 64.22 s to 1.29 s.

Significance. If the empirical claims survive statistical verification and the state representation is shown to be robust, the work supplies concrete evidence that learned adaptive operator selection can preserve near-HGGA solution quality while delivering roughly 50-fold wall-clock savings on standard 1D-BPP instances. Such a result would be of interest to the metaheuristics community as a practical demonstration of RL-driven hybridization for grouping problems.

major comments (3)
  1. [Abstract] Abstract: the headline performance figures (0.95 % gap, 50× speedup) are presented as point estimates with no accompanying variance, standard deviations, or statistical significance tests across the benchmark suites. This omission leaves the central claim that RL-HGGA is “competitive with HGGA” only moderately supported.
  2. [Abstract] Abstract (and implied Experiments section): the eight-dimensional state vector is described with two unspecified features, and the manuscript supplies neither an ablation that isolates the contribution of individual state components nor a transfer experiment on instance distributions outside the four named suites. Because the reported speed/quality trade-off rests on the Q-agent learning a reliable policy from precisely this state, the absence of such validation directly weakens the modeling assumption identified as weakest.
  3. [Abstract] Abstract: training-instance overlap with the test benchmarks is not disclosed. If any of the 400 training episodes used instances from Falkenauer T/U, Scholl, or Hard28, the generalization claim cannot be evaluated from the reported numbers alone.
minor comments (1)
  1. [Abstract] The two unspecified components of the state vector should be named and their definitions supplied in the methods section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical presentation, state representation details, and training disclosure. We address each major comment below, indicating revisions where the manuscript will be updated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance figures (0.95 % gap, 50× speedup) are presented as point estimates with no accompanying variance, standard deviations, or statistical significance tests across the benchmark suites. This omission leaves the central claim that RL-HGGA is “competitive with HGGA” only moderately supported.

    Authors: We agree that variance measures and statistical tests would strengthen the central claims. The revised manuscript will report standard deviations for optimality gaps and runtimes across all benchmark families, together with paired statistical tests (e.g., Wilcoxon signed-rank) comparing RL-HGGA against HGGA and FFD. revision: yes

  2. Referee: [Abstract] Abstract (and implied Experiments section): the eight-dimensional state vector is described with two unspecified features, and the manuscript supplies neither an ablation that isolates the contribution of individual state components nor a transfer experiment on instance distributions outside the four named suites. Because the reported speed/quality trade-off rests on the Q-agent learning a reliable policy from precisely this state, the absence of such validation directly weakens the modeling assumption identified as weakest.

    Authors: We will explicitly enumerate all eight state features in both the abstract and the methods section. An ablation study isolating each component and transfer experiments on additional distributions were not performed in the original submission; we will add a concise discussion of the state design rationale and note the absence of these analyses as a limitation of the current evaluation. revision: partial

  3. Referee: [Abstract] Abstract: training-instance overlap with the test benchmarks is not disclosed. If any of the 400 training episodes used instances from Falkenauer T/U, Scholl, or Hard28, the generalization claim cannot be evaluated from the reported numbers alone.

    Authors: We will add an explicit statement in the experimental setup describing the instances used for the 400 training episodes and their relationship to the test benchmarks, thereby allowing readers to evaluate the generalization claim directly from the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct benchmark comparisons

full rationale

The paper reports experimental results from training a Q-learning agent on 400 episodes and evaluating RL-HGGA against published baselines (HGGA, FFD) on standard benchmark sets (Falkenauer, Scholl, Hard28). The optimality gaps and runtimes are measured outputs, not quantities obtained by fitting a parameter to a subset and then reporting it as a prediction. No equations, uniqueness theorems, or self-citations are invoked to derive the central performance numbers; the 8D state and 8 macro-actions are explicit design choices whose adequacy is an empirical modeling assumption rather than a self-referential reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Ledger entries are inferred from the abstract description only; full paper may contain additional fitted quantities or assumptions.

free parameters (2)
  • epsilon decay target = 0.05
    Final epsilon value of 0.05 chosen for training schedule.
  • training episodes = 400
    Number of episodes used to train the Q-agent.
axioms (1)
  • domain assumption The chosen eight-dimensional state vector is sufficient to distinguish decision-relevant search states.
    Invoked by the design of the RL controller.

pith-pipeline@v0.9.1-grok · 5793 in / 1298 out tokens · 30933 ms · 2026-07-03T02:44:57.374055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages

  1. [1]

    M. R. Garey and D. S. Johnson,Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979

  2. [2]

    Worst-case performance bounds for simple one-dimensional packing algorithms,

    D. S. Johnson, A. Demers, J. D. Ullman, M. R. Garey, and R. L. Graham, “Worst-case performance bounds for simple one-dimensional packing algorithms,”SIAM Journal on Computing, vol. 3, no. 4, pp. 299–325, 1974

  3. [3]

    A hybrid grouping genetic algorithm for bin packing,

    E. Falkenauer, “A hybrid grouping genetic algorithm for bin packing,” Journal of Heuristics, vol. 2, pp. 5–30, 1996

  4. [4]

    Martello and P

    S. Martello and P. Toth,Knapsack Problems: Algorithms and Computer Implementations. Wiley, 1990

  5. [5]

    R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. MIT Press, 2018

  6. [6]

    Q-learning,

    C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, pp. 279–292, 1992

  7. [7]

    Hyper-heuristics: A survey of the state of the art,

    E. K. Burke, M. Gendreau, M. Hyde, G. Kendall, G. Ochoa, S. Ozcan, and R. Qu, “Hyper-heuristics: A survey of the state of the art,”European Journal of Operational Research, vol. 229, no. 3, pp. 329–341, 2013

  8. [8]

    Machine learning for combinato- rial optimization: a methodological tour d’horizon,

    Y . Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinato- rial optimization: a methodological tour d’horizon,”European Journal of Operational Research, vol. 290, no. 2, pp. 405–421, 2021

  9. [9]

    Learning combinatorial optimization algorithms over graphs,

    E. Khalil, H. Dai, Y . Zhang, B. Dilkina, and L. Song, “Learning combinatorial optimization algorithms over graphs,” inAdvances in Neural Information Processing Systems, 2017

  10. [10]

    Metaheuristic approaches for one-dimensional bin packing problem: A comparative performance study,

    C. Munien, S. Mahabeer, E. Dzitiro, S. Singh, S. Zungu, and A. E.- S. Ezugwu, “Metaheuristic approaches for one-dimensional bin packing problem: A comparative performance study,”IEEE Access, vol. 8, pp. 227438–227465, 2020, doi: 10.1109/ACCESS.2020.3046185

  11. [11]

    BPPLIB: a library for bin packing and cutting stock problems,

    M. Delorme, M. Iori, and S. Martello, “BPPLIB: a library for bin packing and cutting stock problems,”Optimization Letters, vol. 12, pp. 235–250, 2018