Hybridizing a Grouping Metaheuristic with Reinforcement Learning for the One-Dimensional Bin Packing Problem

Badaoui Ikram; Bousdjira Nadine; Hasnaoui Sarah; Mostefai Mounir Sofiane; Tati Youcef; Zitouni Rania

arxiv: 2607.02315 · v1 · pith:VF2EMNYDnew · submitted 2026-07-02 · 💻 cs.NE

Hybridizing a Grouping Metaheuristic with Reinforcement Learning for the One-Dimensional Bin Packing Problem

Zitouni Rania , Mostefai Mounir Sofiane , Tati Youcef , Badaoui Ikram , Bousdjira Nadine , Hasnaoui Sarah This is my paper

Pith reviewed 2026-07-03 02:44 UTC · model grok-4.3

classification 💻 cs.NE

keywords one-dimensional bin packinghybrid genetic algorithmQ-learningreinforcement learningoperator selectioncombinatorial optimizationmetaheuristic

0 comments

The pith

A Q-learning agent selecting operators for a grouping genetic algorithm solves bin packing instances with near-baseline quality at 50 times lower runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RL-HGGA, which adds a tabular Q-learning controller to Falkenauer's Hybrid Grouping Genetic Algorithm for the one-dimensional bin packing problem. The controller replaces fixed operator probabilities with dynamic selection among eight macro-actions, guided by an eight-dimensional state vector that tracks generation progress, stagnation, optimality gap, fitness statistics, and bin fill rates. On standard benchmark families the method reaches an average optimality gap of 0.95 percent, close to the 0.75 percent of the original HGGA, while reducing mean computation time from 64.22 seconds to 1.29 seconds.

Core claim

RL-HGGA integrates a Q-learning agent that chooses among eight macro-actions (BPCX crossover, light and heavy mutation, Martello-Toth local search, and population restart) according to an eight-dimensional state vector encoding generation progress, stagnation level, optimality gap, average fitness, population variance, and average bin fill rate. The agent is trained with an epsilon-greedy policy over 400 episodes. On the Falkenauer T/U, Scholl 1-3, and Hard28 families this yields an average optimality gap of 0.95 percent while cutting mean runtime by a factor of fifty relative to the static HGGA baseline.

What carries the argument

The tabular Q-learning agent that maps an eight-dimensional state vector of search dynamics to selection among eight macro-actions for adaptive operator control.

If this is right

Learned adaptive operator selection can replace static probabilities while preserving solution quality on standard 1D-BPP benchmarks.
The hybrid method reduces mean computation time from 64.22 s to 1.29 s across the evaluated families.
The same state-action design yields an average optimality gap of 0.95 percent, competitive with the 0.75 percent gap of the unmodified HGGA.
Epsilon-greedy training over 400 episodes suffices to learn effective control policies for these instance sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reinforcement-learning controller could be attached to other grouping genetic algorithms for related packing or partitioning problems.
If the eight state dimensions prove insufficient on larger or differently distributed instances, the state vector would need explicit extension.
The observed fifty-fold speedup may allow practical solution of instances too large for the original HGGA under fixed time limits.

Load-bearing premise

The eight-dimensional state vector and eight macro-actions capture all decision-relevant aspects of the search dynamics across the tested instance distributions.

What would settle it

Evaluating RL-HGGA against HGGA on a fresh collection of bin-packing instances drawn from a different distribution and finding that the optimality-gap or runtime advantage disappears would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2607.02315 by Badaoui Ikram, Bousdjira Nadine, Hasnaoui Sarah, Mostefai Mounir Sofiane, Tati Youcef, Zitouni Rania.

**Figure 2.** Figure 2: Average gap to the lower bound per family: comparison between FFD, HGGA, and RL-HGGA. Lower values indicate better solutions. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Average number of bins used per family: comparison between FFD, HGGA, and RL-HGGA. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of average execution times per family between FFD, HGGA, and RL-HGGA. The logarithmic scale allows visualization of large time [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: RL-HGGA training progression: evolution of the training gap and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of macro-actions learned by RL-HGGA under the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of convergence behavior between HGGA and RL [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

The one-dimensional bin packing problem (1D-BPP) is a canonical NP-hard combinatorial optimization problem with broad industrial applications. We propose RL-HGGA, a hybrid algorithm that integrates Falkenauer's Hybrid Grouping Genetic Algorithm (HGGA) with a tabular Q-learning controller. Rather than applying genetic operators at fixed probabilities, a Q-learning agent dynamically selects among eight macro-actions -- including BPCX crossover, light and heavy mutation, Martello-Toth local search, and population restart -- based on an eight-dimensional state representation encoding generation progress, stagnation level, optimality gap, average fitness, population variance, and average bin fill rate. The agent is trained with an epsilon-greedy policy over 400 episodes, with epsilon decaying to 0.05. Experiments on standard benchmark families (Falkenauer T/U, Scholl 1-3, Hard28) show that RL-HGGA achieves an average optimality gap of 0.95% -- competitive with HGGA (0.75%) and well below FFD (2.47%) -- while reducing mean computation time from 64.22 s to 1.29 s, a 50x speedup. These results demonstrate that learned adaptive operator selection can achieve near-HGGA solution quality at a fraction of the computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL-HGGA shows a 50x runtime cut on standard bin-packing sets by letting Q-learning pick among eight operators, but the 8D state may miss key search signals and the results lack variance or tests.

read the letter

The main takeaway is that this hybrid gets solution quality within 0.2% of plain HGGA on Falkenauer, Scholl, and Hard28 instances while dropping average time from 64 s to 1.3 s. That speed-up comes from training a tabular Q-agent over 400 episodes to choose among BPCX, two mutation strengths, Martello-Toth search, and restart instead of using fixed probabilities.

What is new is the exact eight-dimensional state vector (generation progress, stagnation, gap, fitness, variance, fill rate, plus two more) paired with those eight macro-actions inside the existing HGGA framework. The paper does a clean job of describing the controller and running it on the usual benchmark families, which lets readers see the claimed trade-off directly.

The soft spots are the missing pieces that would make the performance claim solid. There are no standard deviations, no statistical tests, and no ablation that checks whether dropping any state component hurts the policy. The stress-test note is right that the state may not capture residual item-size distribution or bin-structure details that matter for operator choice; without transfer tests or sensitivity checks, it is hard to know how far the learned policy generalizes. Training-instance overlap with the test sets is also not spelled out.

This is the kind of practical engineering note that belongs in an applied evolutionary-computation venue. A reader who already works on operator selection or on bin-packing heuristics will get a usable data point and a concrete state-action design to build on. It does not change theory, but the empirical comparison is clear enough to deserve referee time once the statistical reporting and a short ablation are added.

Referee Report

3 major / 1 minor

Summary. The paper proposes RL-HGGA, which augments Falkenauer's Hybrid Grouping Genetic Algorithm (HGGA) with a tabular Q-learning agent that dynamically chooses among eight macro-actions (BPCX crossover, light/heavy mutation, Martello-Toth local search, restart, etc.) conditioned on an eight-dimensional state vector (generation progress, stagnation, optimality gap, average fitness, population variance, average bin fill rate, plus two unspecified components). Trained for 400 epsilon-greedy episodes (epsilon decaying to 0.05), the hybrid is evaluated on the Falkenauer T/U, Scholl 1-3, and Hard28 benchmark families, where it reports a mean optimality gap of 0.95 % (versus 0.75 % for HGGA and 2.47 % for FFD) together with a reduction in mean wall-clock time from 64.22 s to 1.29 s.

Significance. If the empirical claims survive statistical verification and the state representation is shown to be robust, the work supplies concrete evidence that learned adaptive operator selection can preserve near-HGGA solution quality while delivering roughly 50-fold wall-clock savings on standard 1D-BPP instances. Such a result would be of interest to the metaheuristics community as a practical demonstration of RL-driven hybridization for grouping problems.

major comments (3)

[Abstract] Abstract: the headline performance figures (0.95 % gap, 50× speedup) are presented as point estimates with no accompanying variance, standard deviations, or statistical significance tests across the benchmark suites. This omission leaves the central claim that RL-HGGA is “competitive with HGGA” only moderately supported.
[Abstract] Abstract (and implied Experiments section): the eight-dimensional state vector is described with two unspecified features, and the manuscript supplies neither an ablation that isolates the contribution of individual state components nor a transfer experiment on instance distributions outside the four named suites. Because the reported speed/quality trade-off rests on the Q-agent learning a reliable policy from precisely this state, the absence of such validation directly weakens the modeling assumption identified as weakest.
[Abstract] Abstract: training-instance overlap with the test benchmarks is not disclosed. If any of the 400 training episodes used instances from Falkenauer T/U, Scholl, or Hard28, the generalization claim cannot be evaluated from the reported numbers alone.

minor comments (1)

[Abstract] The two unspecified components of the state vector should be named and their definitions supplied in the methods section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the empirical presentation, state representation details, and training disclosure. We address each major comment below, indicating revisions where the manuscript will be updated.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance figures (0.95 % gap, 50× speedup) are presented as point estimates with no accompanying variance, standard deviations, or statistical significance tests across the benchmark suites. This omission leaves the central claim that RL-HGGA is “competitive with HGGA” only moderately supported.

Authors: We agree that variance measures and statistical tests would strengthen the central claims. The revised manuscript will report standard deviations for optimality gaps and runtimes across all benchmark families, together with paired statistical tests (e.g., Wilcoxon signed-rank) comparing RL-HGGA against HGGA and FFD. revision: yes
Referee: [Abstract] Abstract (and implied Experiments section): the eight-dimensional state vector is described with two unspecified features, and the manuscript supplies neither an ablation that isolates the contribution of individual state components nor a transfer experiment on instance distributions outside the four named suites. Because the reported speed/quality trade-off rests on the Q-agent learning a reliable policy from precisely this state, the absence of such validation directly weakens the modeling assumption identified as weakest.

Authors: We will explicitly enumerate all eight state features in both the abstract and the methods section. An ablation study isolating each component and transfer experiments on additional distributions were not performed in the original submission; we will add a concise discussion of the state design rationale and note the absence of these analyses as a limitation of the current evaluation. revision: partial
Referee: [Abstract] Abstract: training-instance overlap with the test benchmarks is not disclosed. If any of the 400 training episodes used instances from Falkenauer T/U, Scholl, or Hard28, the generalization claim cannot be evaluated from the reported numbers alone.

Authors: We will add an explicit statement in the experimental setup describing the instances used for the 400 training episodes and their relationship to the test benchmarks, thereby allowing readers to evaluate the generalization claim directly from the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims rest on direct benchmark comparisons

full rationale

The paper reports experimental results from training a Q-learning agent on 400 episodes and evaluating RL-HGGA against published baselines (HGGA, FFD) on standard benchmark sets (Falkenauer, Scholl, Hard28). The optimality gaps and runtimes are measured outputs, not quantities obtained by fitting a parameter to a subset and then reporting it as a prediction. No equations, uniqueness theorems, or self-citations are invoked to derive the central performance numbers; the 8D state and 8 macro-actions are explicit design choices whose adequacy is an empirical modeling assumption rather than a self-referential reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Ledger entries are inferred from the abstract description only; full paper may contain additional fitted quantities or assumptions.

free parameters (2)

epsilon decay target = 0.05
Final epsilon value of 0.05 chosen for training schedule.
training episodes = 400
Number of episodes used to train the Q-agent.

axioms (1)

domain assumption The chosen eight-dimensional state vector is sufficient to distinguish decision-relevant search states.
Invoked by the design of the RL controller.

pith-pipeline@v0.9.1-grok · 5793 in / 1298 out tokens · 30933 ms · 2026-07-03T02:44:57.374055+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 1 canonical work pages

[1]

M. R. Garey and D. S. Johnson,Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979

1979
[2]

Worst-case performance bounds for simple one-dimensional packing algorithms,

D. S. Johnson, A. Demers, J. D. Ullman, M. R. Garey, and R. L. Graham, “Worst-case performance bounds for simple one-dimensional packing algorithms,”SIAM Journal on Computing, vol. 3, no. 4, pp. 299–325, 1974

1974
[3]

A hybrid grouping genetic algorithm for bin packing,

E. Falkenauer, “A hybrid grouping genetic algorithm for bin packing,” Journal of Heuristics, vol. 2, pp. 5–30, 1996

1996
[4]

Martello and P

S. Martello and P. Toth,Knapsack Problems: Algorithms and Computer Implementations. Wiley, 1990

1990
[5]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. MIT Press, 2018

2018
[6]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, pp. 279–292, 1992

1992
[7]

Hyper-heuristics: A survey of the state of the art,

E. K. Burke, M. Gendreau, M. Hyde, G. Kendall, G. Ochoa, S. Ozcan, and R. Qu, “Hyper-heuristics: A survey of the state of the art,”European Journal of Operational Research, vol. 229, no. 3, pp. 329–341, 2013

2013
[8]

Machine learning for combinato- rial optimization: a methodological tour d’horizon,

Y . Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinato- rial optimization: a methodological tour d’horizon,”European Journal of Operational Research, vol. 290, no. 2, pp. 405–421, 2021

2021
[9]

Learning combinatorial optimization algorithms over graphs,

E. Khalil, H. Dai, Y . Zhang, B. Dilkina, and L. Song, “Learning combinatorial optimization algorithms over graphs,” inAdvances in Neural Information Processing Systems, 2017

2017
[10]

Metaheuristic approaches for one-dimensional bin packing problem: A comparative performance study,

C. Munien, S. Mahabeer, E. Dzitiro, S. Singh, S. Zungu, and A. E.- S. Ezugwu, “Metaheuristic approaches for one-dimensional bin packing problem: A comparative performance study,”IEEE Access, vol. 8, pp. 227438–227465, 2020, doi: 10.1109/ACCESS.2020.3046185

work page doi:10.1109/access.2020.3046185 2020
[11]

BPPLIB: a library for bin packing and cutting stock problems,

M. Delorme, M. Iori, and S. Martello, “BPPLIB: a library for bin packing and cutting stock problems,”Optimization Letters, vol. 12, pp. 235–250, 2018

2018

[1] [1]

M. R. Garey and D. S. Johnson,Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979

1979

[2] [2]

Worst-case performance bounds for simple one-dimensional packing algorithms,

D. S. Johnson, A. Demers, J. D. Ullman, M. R. Garey, and R. L. Graham, “Worst-case performance bounds for simple one-dimensional packing algorithms,”SIAM Journal on Computing, vol. 3, no. 4, pp. 299–325, 1974

1974

[3] [3]

A hybrid grouping genetic algorithm for bin packing,

E. Falkenauer, “A hybrid grouping genetic algorithm for bin packing,” Journal of Heuristics, vol. 2, pp. 5–30, 1996

1996

[4] [4]

Martello and P

S. Martello and P. Toth,Knapsack Problems: Algorithms and Computer Implementations. Wiley, 1990

1990

[5] [5]

R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 2nd ed. MIT Press, 2018

2018

[6] [6]

Q-learning,

C. J. C. H. Watkins and P. Dayan, “Q-learning,”Machine Learning, vol. 8, pp. 279–292, 1992

1992

[7] [7]

Hyper-heuristics: A survey of the state of the art,

E. K. Burke, M. Gendreau, M. Hyde, G. Kendall, G. Ochoa, S. Ozcan, and R. Qu, “Hyper-heuristics: A survey of the state of the art,”European Journal of Operational Research, vol. 229, no. 3, pp. 329–341, 2013

2013

[8] [8]

Machine learning for combinato- rial optimization: a methodological tour d’horizon,

Y . Bengio, A. Lodi, and A. Prouvost, “Machine learning for combinato- rial optimization: a methodological tour d’horizon,”European Journal of Operational Research, vol. 290, no. 2, pp. 405–421, 2021

2021

[9] [9]

Learning combinatorial optimization algorithms over graphs,

E. Khalil, H. Dai, Y . Zhang, B. Dilkina, and L. Song, “Learning combinatorial optimization algorithms over graphs,” inAdvances in Neural Information Processing Systems, 2017

2017

[10] [10]

Metaheuristic approaches for one-dimensional bin packing problem: A comparative performance study,

C. Munien, S. Mahabeer, E. Dzitiro, S. Singh, S. Zungu, and A. E.- S. Ezugwu, “Metaheuristic approaches for one-dimensional bin packing problem: A comparative performance study,”IEEE Access, vol. 8, pp. 227438–227465, 2020, doi: 10.1109/ACCESS.2020.3046185

work page doi:10.1109/access.2020.3046185 2020

[11] [11]

BPPLIB: a library for bin packing and cutting stock problems,

M. Delorme, M. Iori, and S. Martello, “BPPLIB: a library for bin packing and cutting stock problems,”Optimization Letters, vol. 12, pp. 235–250, 2018

2018