pith. sign in

arxiv: 2607.01984 · v1 · pith:4WMQJCSCnew · submitted 2026-07-02 · 💻 cs.LG · cs.AI· cs.CV

Do Newer Lightweight CNNs Perform Better Under Resource Constraints? A Controlled Multigenerational Study of Architecture, Initialization, Training Budget, and Efficiency

Pith reviewed 2026-07-03 17:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords lightweight CNNmodel comparisonefficiencyPareto frontierCIFARTiny ImageNetMobileNetEfficientNet
0
0 comments X

The pith

Controlled tests find newer lightweight CNNs deliver selective rather than universal gains in accuracy and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares nine lightweight CNN model packages on CIFAR-10, CIFAR-100, and Tiny ImageNet using one shared training protocol and evaluation setup. It isolates architecture effects to check whether later designs consistently beat earlier ones on both accuracy and measured resource use. Results show EfficientNetV2-S leading accuracy on two datasets while EfficientNet-B0 stays within a few points using far fewer parameters and operations and lands on every Pareto frontier. MobileNetV3-Small also beats its successor MobileNetV4-Conv-Small in accuracy and speed under the fixed conditions. Practitioners selecting models for constrained hardware therefore cannot assume newer releases will dominate without running controlled comparisons.

Core claim

Under a shared downstream training protocol, newer lightweight CNN designs provide selective rather than universal gains. EfficientNetV2-S reaches the highest top-1 accuracy on CIFAR-10 and CIFAR-100, yet EfficientNet-B0 remains within 0.22 to 1.79 points of the best result while using roughly 79 percent fewer parameters and 86 percent fewer GMACs and appearing on every accuracy-resource Pareto frontier. MobileNetV3-Small records higher accuracy and lower measured latency than MobileNetV4-Conv-Small on all three datasets, and latency orderings shift between GPU and CPU hardware. Scratch training leaves EfficientNet-B0 well below its pretrained performance even after extended epochs.

What carries the argument

The shared downstream training protocol and evaluation setup that holds initialization, data augmentation, optimizer, and epoch budget fixed across nine model packages.

If this is right

  • EfficientNet-B0 remains competitive across datasets while using substantially fewer parameters and GMACs than later designs.
  • MobileNetV3-Small achieves lower GMAC count, faster CPU inference, and higher accuracy than MobileNetV4-Conv-Small under identical conditions.
  • Latency rankings differ sharply between NVIDIA L4 GPU and AMD Ryzen CPU, showing GMACs alone do not predict measured inference performance.
  • SqueezeNet1.1 records the fewest parameters and lowest peak CUDA memory but substantially weaker accuracy.
  • EfficientNet-B0 stays 3 to 17 points below its pretrained accuracy after 100 epochs of scratch training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model selection for edge devices benefits from empirical Pareto analysis rather than assuming later releases dominate.
  • The interaction between architecture and fixed training budget can favor certain earlier designs over their successors.
  • Hardware-specific latency measurements are required because operation counts do not reliably rank real inference speed.
  • Revisiting well-tuned older models may yield better efficiency trade-offs than adopting the newest variants without controlled re-evaluation.

Load-bearing premise

The shared downstream training protocol and evaluation setup fairly represents typical real-world use cases without systematic biases from initialization or optimization details that favor one generation of models.

What would settle it

Re-training the same nine models with each architecture's originally published training recipe and hyperparameters reverses the observed accuracy ordering between MobileNetV3-Small and MobileNetV4-Conv-Small on CIFAR-10.

Figures

Figures reproduced from arXiv: 2607.01984 by Tasnim Shahriar.

Figure 1
Figure 1. Figure 1: Pretrained top-1 test accuracy. Each model uses the same color and marker identity throughout the paper. Full top-5 results are provided in Appendix A.1. 4.2 Static resources, peak memory, and point estimate Pareto frontiers [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top-1 accuracy against parameters and GMACs. Black lines connect point estimate Pareto-efficient models in order of increasing resource demand. Across CIFAR-10 and CIFAR-100, the parameter frontier contains SqueezeNet1.1, ShuffleNetV2 x1.0, MobileNetV3-Small, EfficientNet-B0, and EfficientNetV2-S. The GMAC frontier contains MobileNetV3-Small, EfficientNet-B0, and EfficientNetV2-S. Tiny ImageNet changes the… view at source ↗
Figure 3
Figure 3. Figure 3: Top-1 accuracy against NVIDIA L4 median latency and Ryzen CPU one-thread median latency. Point estimate Pareto frontiers differ between the evaluated execution environments. The execution-environment-specific ranking shift is substantial. ResNet18 ranks first on the L4 9 [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Latency rank changes across the evaluated NVIDIA L4 FP16 autocast and AMD Ryzen 5 5500U FP32 environments. Descriptive Spearman rank correlations [32] reinforce the same conclusion. GMACs correlate strongly with CPU median latency, ρ = 0.95 for one thread and ρ = 0.93 for four threads, but weakly with L4 median latency, ρ = 0.27. L4 and one-thread CPU latency rankings correlate only moderately, ρ = 0.40. T… view at source ↗
Figure 5
Figure 5. Figure 5: EfficientNet-B0 scratch validation accuracy against cumulative recorded training time, with separate endpoint bars for pretrained, maximum 20 epoch scratch, and fresh 100 epoch scratch schedules. The scratch curves represent distinct runs rather than one continued schedule. 4.6 MobileNetV3-Small and MobileNetV4-Conv-S Under pretrained initialization, MobileNetV3-Small records higher observed test accuracy … view at source ↗
Figure 6
Figure 6. Figure 6: MobileNetV3-Small and MobileNetV4-Conv-S final accuracy, scratch validation accuracy by epoch, and scratch validation accuracy by cumulative recorded training time. The result supports a conditional conclusion. The evaluated later-generation MobileNetV4-Conv-S model package does not improve upon MobileNetV3-Small under the tested public checkpoints, downstream recipe, scratch budget, CPU environment, param… view at source ↗
Figure 7
Figure 7. Figure 7: Selected paired test-set accuracy differences and 95% bootstrap intervals. Open markers indicate intervals that cross zero, while filled markers indicate intervals that exclude zero. These intervals condition on one trained model per setting and do not capture training-seed variability. The numerical paired-comparison table is provided in Appendix A.4. 5 Discussion 5.1 Do newer lightweight CNNs perform bet… view at source ↗
Figure 8
Figure 8. Figure 8: Top-1 accuracy against peak PyTorch CUDA allocated tensor memory during batch size 1 FP16 autocast inference. Frontiers are calculated independently for each dataset. A.3 Descriptive rank associations [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Newer lightweight convolutional neural networks are often presented as improving predictive performance and deployment efficiency, but such claims require controlled evaluation. This study compares nine lightweight CNN model packages across CIFAR-10, CIFAR-100, and Tiny ImageNet under a shared downstream protocol. We report top-1 accuracy, macro F1, top-5 accuracy, parameter count, FP32 storage, GMACs, batch-size-1 latency on an NVIDIA L4 and AMD Ryzen 5 5500U CPU, peak PyTorch CUDA allocated tensor memory, and point estimate Pareto frontiers. EfficientNetV2-S achieves the highest observed top-1 accuracy on CIFAR-10 and CIFAR-100 at 97.57% and 86.98%, while RepViT-M1.0 leads Tiny ImageNet at 79.87%. EfficientNet-B0 remains within 0.22, 0.85, and 1.79 percentage points of the best result on the three datasets while using approximately 79% fewer parameters and 86% fewer GMACs than EfficientNetV2-S. It also appears on every evaluated accuracy and resource Pareto frontier, making it the most consistently competitive intermediate-budget option. MobileNetV3-Small has the lowest GMAC count, is the fastest model under both CPU thread settings, and records higher observed accuracy than MobileNetV4-Conv-S on all three datasets. Under random initialization, it leads MobileNetV4-Conv-S by 2.55, 1.76, and 0.99 points, with paired test-set intervals excluding zero for the fixed trained models. EfficientNet-B0 remains 3.29, 10.10, and 17.54 points below its pretrained counterpart after 100 epochs of scratch training, despite requiring about five times the recorded training time. SqueezeNet1.1 has the fewest parameters and lowest peak CUDA allocation, but substantially weaker accuracy. Latency rankings differ sharply between the L4 and CPU environments, showing that GMACs alone do not predict measured inference performance. Overall, newer designs provide selective rather than universal gains

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper performs a controlled empirical comparison of nine lightweight CNN packages (EfficientNet-B0/V2-S, MobileNetV3-Small/V4-Conv-S, RepViT-M1.0, SqueezeNet1.1 and others) on CIFAR-10, CIFAR-100 and Tiny ImageNet. Using a single shared downstream training protocol (100 epochs, fixed initialization/augmentation/optimizer), it reports top-1 accuracy, macro F1, top-5 accuracy, parameter count, GMACs, GPU/CPU latency, peak CUDA memory and accuracy-resource Pareto frontiers. Main claims are that EfficientNetV2-S leads CIFAR accuracy, EfficientNet-B0 remains competitive with far lower resources and appears on all frontiers, MobileNetV3-Small outperforms MobileNetV4-Conv-S under the protocol (including under random init), and newer designs yield only selective rather than universal gains.

Significance. If the shared protocol is architecture-neutral, the work supplies concrete, multi-metric evidence that newer lightweight CNNs do not deliver uniform improvements under fixed resource budgets, underscoring the value of older baselines like EfficientNet-B0 and the poor correlation between GMACs and measured latency. The explicit Pareto analysis and cross-environment latency comparison are useful for practitioners choosing models under deployment constraints.

major comments (2)
  1. [Methods (training protocol)] Methods (training protocol description): The central claim that 'newer designs provide selective rather than universal gains' rests on the assumption that the fixed 100-epoch shared recipe is fair across generations. Newer models (EfficientNetV2-S, MobileNetV4) were originally published with progressive learning, distinct RandAugment policies and optimizer schedules; no ablation or per-model hyperparameter search is reported to isolate architecture from protocol mismatch. This directly affects interpretation of the MobileNetV3-Small vs MobileNetV4-Conv-S gap and the EfficientNet-B0 competitiveness result.
  2. [Results (accuracy and Pareto sections)] Results (accuracy tables and Pareto frontiers): Only point estimates are shown for most comparisons; while paired intervals are mentioned for the MobileNetV3/V4 case, no multiple random seeds, statistical tests, or variance estimates are provided for the 0.22–1.79 pp gaps cited for EfficientNet-B0 or the RepViT Tiny-ImageNet lead. This weakens the load-bearing claim that EfficientNet-B0 'remains within X points of the best result' and appears on every frontier.
minor comments (2)
  1. [Abstract and Results] Abstract and results: The phrase 'point estimate Pareto frontiers' is used without clarifying whether the frontiers are constructed from single runs or whether uncertainty bands are considered.
  2. [Methods] The manuscript should explicitly state the exact data-augmentation pipeline, learning-rate schedule, and optimizer hyperparameters used in the shared protocol so readers can assess compatibility with each model's original recipe.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major point below, focusing on the controlled nature of the study.

read point-by-point responses
  1. Referee: Methods (training protocol description): The central claim that 'newer designs provide selective rather than universal gains' rests on the assumption that the fixed 100-epoch shared recipe is fair across generations. Newer models (EfficientNetV2-S, MobileNetV4) were originally published with progressive learning, distinct RandAugment policies and optimizer schedules; no ablation or per-model hyperparameter search is reported to isolate architecture from protocol mismatch. This directly affects interpretation of the MobileNetV3-Small vs MobileNetV4-Conv-S gap and the EfficientNet-B0 competitiveness result.

    Authors: The manuscript's contribution is a controlled comparison under one shared downstream protocol (explicitly described in Methods and the abstract) to isolate architecture effects rather than to optimize each model individually. This fixed-recipe design is the basis for the 'selective rather than universal gains' claim. We will add an explicit statement in the Methods and Discussion sections clarifying that no per-model hyperparameter search or protocol adaptation was performed, so results reflect performance under the common recipe and not necessarily the best attainable accuracy for each architecture. revision: partial

  2. Referee: Results (accuracy tables and Pareto frontiers): Only point estimates are shown for most comparisons; while paired intervals are mentioned for the MobileNetV3/V4 case, no multiple random seeds, statistical tests, or variance estimates are provided for the 0.22–1.79 pp gaps cited for EfficientNet-B0 or the RepViT Tiny-ImageNet lead. This weakens the load-bearing claim that EfficientNet-B0 'remains within X points of the best result' and appears on every frontier.

    Authors: Paired test-set intervals were computed and reported for the MobileNetV3/V4 comparison. For the remaining gaps we report single-run point estimates, which is standard in many architecture benchmark papers. We acknowledge that multi-seed variance would strengthen the claims. We will add a limitations paragraph noting the single-seed point estimates for most metrics and the consequent reliance on observed differences rather than statistical tests. revision: partial

Circularity Check

0 steps flagged

Purely empirical comparison; no derivations or self-referential predictions

full rationale

The manuscript performs a controlled multigenerational empirical study of nine lightweight CNN packages on CIFAR-10/100 and Tiny ImageNet. It reports measured top-1 accuracy, F1, latency, GMACs, memory, and Pareto frontiers under one fixed downstream protocol. No equations, fitted parameters, uniqueness theorems, or ansatzes appear; the headline claim of selective rather than universal gains is a direct summary of the tabulated experimental outcomes. No load-bearing step reduces to a self-citation or to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on specific free parameters, axioms, or invented entities; the study relies on standard CNN training and evaluation practices whose details are not stated.

pith-pipeline@v0.9.1-grok · 5943 in / 1036 out tokens · 23802 ms · 2026-07-03T17:18:43.589421+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

  1. [1]

    Deep learning,

    Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, pp. 436–444, 2015

  2. [2]

    ImageNet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2009, pp. 248–255

  3. [3]

    Learning multiple layers of features from tiny images,

    A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Technical Report, 2009

  4. [4]

    Tiny ImageNet visual recognition challenge,

    Y. Le and X. Yang, “Tiny ImageNet visual recognition challenge,” Stanford CS231n Course Project, 2015

  5. [5]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778

  6. [6]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

    F. N. Iandola et al., “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and less than 0.5MB model size,” arXiv:1602.07360, 2016

  7. [7]

    MobileNetV2: Inverted residuals and linear bottlenecks,

    M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4510–4520

  8. [8]

    ShuffleNet V2: Practical guidelines for efficient CNN architecture design,

    N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” inProc. Eur. Conf. Comput. Vis., 2018, pp. 116–131. 17

  9. [9]

    Searching for MobileNetV3,

    A. Howard et al., “Searching for MobileNetV3,” inProc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1314–1324

  10. [10]

    EfficientNet: Rethinking model scaling for convolutional neural networks,

    M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” inProc. Int. Conf. Mach. Learn., 2019, pp. 6105–6114

  11. [11]

    EfficientNetV2: Smaller models and faster training,

    M. Tan and Q. Le, “EfficientNetV2: Smaller models and faster training,” inProc. Int. Conf. Mach. Learn., 2021, pp. 10096–10106

  12. [12]

    MobileOne: An improved one millisecond mobile backbone,

    P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan, “MobileOne: An improved one millisecond mobile backbone,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 7907–7917

  13. [13]

    Run, Don’t Walk: Chasing higher FLOPS for faster neural networks,

    J. Chen et al., “Run, Don’t Walk: Chasing higher FLOPS for faster neural networks,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 12021–12031

  14. [14]

    FastViT: A fast hybrid vision transformer using structural reparameterization,

    P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan, “FastViT: A fast hybrid vision transformer using structural reparameterization,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2023, pp. 5785–5795

  15. [15]

    RepViT: Revisiting mobile CNN from ViT perspective,

    A. Wang, H. Chen, Z. Lin, J. Han, and G. Ding, “RepViT: Revisiting mobile CNN from ViT perspective,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 15909–15920

  16. [16]

    MobileNetV4: Universal models for the mobile ecosystem,

    D. Qin et al., “MobileNetV4: Universal models for the mobile ecosystem,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 78–96

  17. [17]

    Rewrite the Stars,

    X. Ma, X. Dai, Y. Bai, Y. Wang, and Y. Fu, “Rewrite the Stars,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2024, pp. 5694–5703

  18. [18]

    LSNet: See Large, Focus Small,

    A. Wang, H. Chen, Z. Lin, J. Han, and G. Ding, “LSNet: See Large, Focus Small,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2025, pp. 9718–9729

  19. [19]

    UniConvNet: Expanding effective receptive field while maintaining asymptotically Gaussian distribution for ConvNets of any scale,

    Y. Wang and W. Xi, “UniConvNet: Expanding effective receptive field while maintaining asymptotically Gaussian distribution for ConvNets of any scale,” inProc. IEEE/CVF Int. Conf. Comput. Vis., 2025, pp. 20922–20933

  20. [20]

    Do better ImageNet models transfer better?

    S. Kornblith, M. Norouzi, H. Lee, and G. Hinton, “Do better ImageNet models transfer better?” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 2661–2671

  21. [21]

    Rethinking ImageNet pre-training,

    K. He, R. Girshick, and P. Dollár, “Rethinking ImageNet pre-training,” inProc. IEEE Int. Conf. Comput. Vis., 2019, pp. 4918–4927

  22. [22]

    Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks,

    M. Goldblum et al., “Battle of the backbones: A large-scale comparison of pretrained models across computer vision tasks,” inAdvances in Neural Information Processing Systems, vol. 36, Datasets and Benchmarks Track, 2023

  23. [23]

    A comprehensive study of transfer learning under constraints,

    T. Pégeot, I. Kucher, A. Popescu, and B. Delezoide, “A comprehensive study of transfer learning under constraints,” inProc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 2023, pp. 1148–1157

  24. [24]

    RCV2023 challenges: Benchmarking model training and inference for resource- constrained deep learning,

    R. Tiwari et al., “RCV2023 challenges: Benchmarking model training and inference for resource- constrained deep learning,” inProc. IEEE/CVF Int. Conf. Comput. Vis. Workshops, 2023, pp. 1534–1543

  25. [25]

    Which backbone to use: A resource-efficient domain specific comparison for computer vision,

    P. Jeevan and A. Sethi, “Which backbone to use: A resource-efficient domain specific comparison for computer vision,”Transactions on Machine Learning Research, 2025

  26. [26]

    Vision backbone efficient selection for image classification in low-data regimes,

    J. Guerin, S. Bansal, A. Shaban, P. Mann, and H. Gazula, “Vision backbone efficient selection for image classification in low-data regimes,” inProc. 36th British Mach. Vis. Conf., 2025, Paper 788

  27. [27]

    Comparative Analysis of Lightweight CNNs for Resource-Constrained Devices: Predictive Performance, Efficiency Trade-offs, and Initialization Effects

    T. Shahriar, “Comparative Analysis of Lightweight CNNs for Resource-Constrained Devices: Predictive Performance, Efficiency Trade-offs, and Initialization Effects,” arXiv:2505.03303, 2025

  28. [28]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. Int. Conf. Learn. Represent., 2019

  29. [29]

    SGDR: Stochastic gradient descent with warm restarts,

    I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” inProc. Int. Conf. Learn. Represent., 2017. 18

  30. [30]

    Rethinking the Inception architecture for computer vision,

    C. Szegedy et al., “Rethinking the Inception architecture for computer vision,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826

  31. [31]

    Miettinen,Nonlinear Multiobjective Optimization

    K. Miettinen,Nonlinear Multiobjective Optimization. Boston, MA, USA: Kluwer Academic Publishers, 1999

  32. [32]

    The proof and measurement of association between two things,

    C. Spearman, “The proof and measurement of association between two things,”The American Journal of Psychology, vol. 15, no. 1, pp. 72–101, 1904

  33. [33]

    Note on the sampling error of the difference between correlated proportions or percentages,

    Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, pp. 153–157, 1947

  34. [34]

    Efron and R

    B. Efron and R. J. Tibshirani,An Introduction to the Bootstrap. New York, NY, USA: Chapman and Hall, 1993

  35. [35]

    PyTorch Image Models,

    R. Wightman, “PyTorch Image Models,” GitHub repository and Zenodo software archive, 2019. 19