WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution
Pith reviewed 2026-07-03 15:36 UTC · model grok-4.3
The pith
Windowed batch matrix multiplication enables efficient computation of large receptive field convolutions by converting irregular memory access into regular batched matrix operations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WBMM partitions the input feature map into contiguous windows and indexes a compact relative-position bias table to construct the weight matrices, allowing the large receptive field depthwise convolution to be performed as a batched matrix multiplication with regular memory access patterns.
What carries the argument
Windowed Batch Matrix Multiplication (WBMM) that partitions input into windows and uses a compact relative position bias table to construct weight matrices for batched matrix multiplication, enabling regular memory access.
If this is right
- WBMM with 14x14 windows runs faster than 5x5 depthwise convolution while providing a 7.8 times larger per-layer receptive field.
- Combined with inter-block cross-window communication and hierarchical window reparameterization, it achieves 1.31-1.88x training speedup with comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K.
- Throughput improves as window size increases, opposite to standard depthwise convolutions.
- Advantages hold across GPU, CPU, and edge devices without needing specialized kernels.
Where Pith is reading between the lines
- This method may allow vision transformers or CNNs to incorporate much larger kernels than previously practical.
- Similar windowing and bias table techniques could be applied to other operations suffering from irregular memory access in deep learning.
- By avoiding the need for custom acceleration kernels, it lowers the barrier for deploying large receptive field models on diverse hardware.
Load-bearing premise
Constructing the weight matrices from the compact relative-position bias table inside each window exactly preserves the receptive field coverage and numerical behavior of a full large-kernel depthwise convolution.
What would settle it
A direct numerical comparison showing whether the output of WBMM matches that of a standard large-kernel depthwise convolution within floating point tolerance on the same input, or an accuracy difference exceeding 0.5% on ImageNet-1K when replacing one with the other in a model.
Figures
read the original abstract
Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation; while Large Kernel Acceleration (LKA) helps on small feature maps, it becomes counterproductive on large feature maps, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which partitions input into contiguous windows and indexes a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication. This yields a unique property: WBMM's throughput improves with larger windows, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with 14x14 windows outperforms 5x5 depthwise convolution baselines in speed while providing a 7.8x larger per-layer receptive field. Combined with inter-block cross-window communication and hierarchical window reparameterization, WBMM achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31-1.88x training speedup, and demonstrates consistent advantages across GPU, CPU, and edge devices without requiring specialized acceleration kernels. Our code is available at http://github.com/wansong-s/WBMM
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Windowed Batch Matrix Multiplication (WBMM) to implement large receptive-field depthwise convolutions by partitioning feature maps into contiguous windows, indexing a compact relative-position bias table to build weight matrices, and performing batched matrix multiplication for regular memory access. It claims this yields throughput that improves with window size (opposite to standard depthwise conv), operator-level speedups with 14x14 windows versus 5x5 baselines while delivering 7.8x larger per-layer receptive field, and, when augmented with inter-block cross-window communication plus hierarchical reparameterization, 1.31-1.88x training speedups with comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K across GPU/CPU/edge devices; code is released.
Significance. If the equivalence between WBMM and reference large-kernel depthwise convolution holds exactly (receptive-field coverage and numerics) and the reported speedups prove reproducible without hidden accuracy loss, the approach could provide a practical route to scale receptive fields in convolutional backbones without custom kernels or specialized hardware, with potential impact on efficient vision model design.
major comments (2)
- [§3] §3 (WBMM construction): the central claim that indexing the compact relative-position bias table and performing windowed batched matmul exactly replicates the receptive-field coverage and numerical behavior of a gather-based large-kernel depthwise convolution is asserted but unsupported by any direct element-wise output comparison, kernel-equivalence test, or boundary-handling verification at the 14x14 scale used for the speed claims; without this check, discrepancies in effective kernel support or accumulation order remain possible even if downstream accuracies match.
- [§4, §5] §4 (operator benchmarks) and §5 (end-to-end results): the reported 1.31-1.88x training speedups and accuracy numbers on ImageNet-1K/COCO/ADE20K lack accompanying experimental protocol details, hardware specifications, error bars, or full ablation tables isolating the contribution of WBMM versus the added cross-window and reparameterization components; this makes it impossible to assess whether the performance advantage is load-bearing or reproducible.
minor comments (2)
- The abstract states concrete speed/accuracy numbers but the main text should explicitly cross-reference the corresponding tables/figures for each claim.
- Notation for window partitioning and bias-table indexing could be clarified with a small diagram or pseudocode snippet to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments on the WBMM manuscript. We address the two major comments below and will incorporate revisions to strengthen the claims on equivalence and experimental reproducibility.
read point-by-point responses
-
Referee: [§3] §3 (WBMM construction): the central claim that indexing the compact relative-position bias table and performing windowed batched matmul exactly replicates the receptive-field coverage and numerical behavior of a gather-based large-kernel depthwise convolution is asserted but unsupported by any direct element-wise output comparison, kernel-equivalence test, or boundary-handling verification at the 14x14 scale used for the speed claims; without this check, discrepancies in effective kernel support or accumulation order remain possible even if downstream accuracies match.
Authors: We agree that an explicit numerical verification would make the equivalence claim more robust. By construction, WBMM partitions the feature map into contiguous windows and uses the compact relative-position bias table to assemble the exact weight matrix for each window before batched matrix multiplication; this is mathematically identical to the gather-based depthwise convolution (same kernel weights, same receptive-field support per output position, and identical accumulation). Window boundaries are handled by the partitioning scheme to preserve the original convolution semantics without padding artifacts inside windows. Nevertheless, we will add a direct element-wise output comparison (including L2 difference and boundary cases) between WBMM and a reference large-kernel depthwise convolution implementation at the 14×14 scale in the revised §3. revision: yes
-
Referee: [§4, §5] §4 (operator benchmarks) and §5 (end-to-end results): the reported 1.31-1.88x training speedups and accuracy numbers on ImageNet-1K/COCO/ADE20K lack accompanying experimental protocol details, hardware specifications, error bars, or full ablation tables isolating the contribution of WBMM versus the added cross-window and reparameterization components; this makes it impossible to assess whether the performance advantage is load-bearing or reproducible.
Authors: We acknowledge that the current manuscript provides insufficient protocol transparency. The released code (http://github.com/wansong-s/WBMM) already contains the full training and benchmarking scripts, but we will expand §§4–5 with: (i) complete experimental protocols (optimizer, learning-rate schedule, data augmentation, batch size, number of epochs), (ii) hardware specifications (GPU/CPU/edge device models and software versions), (iii) error bars computed over at least three independent runs, and (iv) expanded ablation tables that isolate the WBMM operator from the inter-block cross-window communication and hierarchical reparameterization modules. These additions will allow readers to assess the contribution and reproducibility of each component. revision: yes
Circularity Check
No circularity; direct algorithmic construction with empirical benchmarks
full rationale
The provided abstract and description contain no equations, derivations, or self-citations that reduce any claimed result (speedups, receptive-field size, or accuracy) to a quantity defined by the method itself or by fitted parameters. WBMM is presented as an explicit construction (window partitioning + indexing of relative-position bias table + batched matmul), and the reported operator benchmarks and end-to-end speedups are measured outcomes rather than quantities forced by definition. No load-bearing self-citation chains, uniqueness theorems, or ansatzes appear. The skeptic concern about exact numerical equivalence is a verification issue, not a circularity reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Batched matrix multiplication on modern GPUs and CPUs exhibits regular memory access and scales favorably with matrix size.
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[2]
Cai, Zhaowei and Vasconcelos, Nuno , journal=. Cascade. 2019 , publisher=
work page 2019
-
[3]
Chen, Honghao and Chu, Xiangxiang and Ren, Yongjian and Zhao, Xin and Huang, Kaiqi , booktitle=
-
[4]
Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=. 2009 , organization=
work page 2009
-
[5]
Ding, Xiaohan and Guo, Yuchen and Ding, Guiguang and Han, Jungong , booktitle=
-
[6]
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in
Ding, Xiaohan and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang , booktitle=. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in
-
[7]
Ding, Xiaohan and Zhang, Xiangyu and Ma, Ningning and Han, Jungong and Ding, Guiguang and Sun, Jian , booktitle=
-
[8]
Ding, Xiaohan and Zhang, Yiyuan and Ge, Yixiao and Zhao, Sijie and Song, Lin and Yue, Xiangyu and Shan, Ying , booktitle=
-
[9]
International Conference on Learning Representations , year=
An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. International Conference on Learning Representations , year=
-
[10]
International Conference on Learning Representations , year=
Conditional Positional Encodings for Vision Transformers , author=. International Conference on Learning Representations , year=
-
[11]
Self-Attention with Relative Position Representations , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , pages=
work page 2018
-
[12]
Ding, Xiaohan and Chen, Honghao and Zhang, Xiangyu and Han, Jungong and Ding, Guiguang , booktitle=
-
[13]
Liang, Jingyun and Cao, Jiezhang and Sun, Guolei and Zhang, Kai and Van Gool, Luc and Timofte, Radu , booktitle=
-
[14]
Hatamizadeh, Ali and Nath, Vishwesh and Tang, Yucheng and Yang, Dong and Roth, Holger R. and Xu, Daguang , booktitle=. 2021 , organization=
work page 2021
-
[15]
Cao, Hu and Wang, Yueyue and Chen, Joy and Jiang, Dongsheng and Zhang, Xiaopeng and Tian, Qi and Wang, Manning , booktitle=. 2022 , organization=
work page 2022
-
[16]
Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining , booktitle=
-
[17]
Liu, Zhuang and Mao, Hanzi and Wu, Chao-Yuan and Feichtenhofer, Christoph and Darrell, Trevor and Xie, Saining , booktitle=. A
-
[18]
Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll. Computer Vision -- ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 , pages=. 2014 , organization=
work page 2014
-
[19]
Chen, Kai and Wang, Jiaqi and Pang, Jiangmiao and Cao, Yuhang and Xiong, Yu and Li, Xiaoxiao and Sun, Shuyang and Feng, Wansen and Liu, Ziwei and Xu, Jiarui and Zhang, Zheng and Cheng, Dazhi and Zhu, Chenchen and Cheng, Tianheng and Zhao, Qijie and Li, Buyu and Lu, Xin and Zhu, Rui and Wu, Yue and Dai, Jifeng and Wang, Jingdong and Shi, Jianping and Ouyan...
-
[20]
Proceedings of the European Conference on Computer Vision (ECCV) , pages=
Unified Perceptual Parsing for Scene Understanding , author=. Proceedings of the European Conference on Computer Vision (ECCV) , pages=
-
[21]
Semantic Understanding of Scenes Through the
Zhou, Bolei and Zhao, Hang and Puig, Xavier and Xiao, Tete and Fidler, Sanja and Barriuso, Adela and Torralba, Antonio , journal=. Semantic Understanding of Scenes Through the. 2019 , publisher=
work page 2019
-
[22]
Liu, Shiwei and Chen, Tianlong and Chen, Xiaohan and Chen, Xuxi and Xiao, Qiao and Wu, Boqian and K. More. International Conference on Learning Representations , year=
-
[23]
Tenth International Workshop on Frontiers in Handwriting Recognition , year=
High Performance Convolutional Neural Networks for Document Processing , author=. Tenth International Workshop on Frontiers in Handwriting Recognition , year=
-
[24]
Chetlur, Sharan and Woolley, Cliff and Vandermersch, Philippe and Cohen, Jonathan and Tran, John and Catanzaro, Bryan and Shelhamer, Evan , journal=
-
[25]
Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor , booktitle=
-
[26]
Advances in Neural Information Processing Systems , pages=
Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and K. Advances in Neural Information Processing Systems , pages=
-
[27]
Computational Visual Media , volume=
Visual Attention Network , author=. Computational Visual Media , volume=. 2023 , publisher=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.