High-Performance NTT Accelerators for PQC leveraging Unified Redundant Arithmetic and Fine-Tuned Microarchitecture
Pith reviewed 2026-07-02 04:18 UTC · model grok-4.3
The pith
A redundant number representation removes conditional corrections from NTT hardware for post-quantum cryptography.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a novel redundant number representation eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations, integrates inverse-transform scaling into existing arithmetic hardware, and supports hierarchical Montgomery multipliers that map efficiently onto FPGA DSP resources, resulting in higher clock frequencies, reduced execution times, and competitive resource utilization for NTT/INTT acceleration.
What carries the argument
Unified butterfly units built on a redundant number representation that removes conditional corrections and folds scaling into the arithmetic path, together with hierarchical Montgomery multipliers mapped to FPGA DSP blocks.
If this is right
- NTT and INTT operations complete in fewer cycles at higher clock rates on FPGA.
- No separate scaling hardware is required after the inverse transform.
- Montgomery multipliers occupy fewer DSP blocks while sustaining higher frequencies.
- The accelerators support both encryption and signature schemes in the same hardware.
- Resource utilization remains competitive with prior designs while delivering speed gains.
Where Pith is reading between the lines
- The same redundant representation might reduce power draw on FPGA or ASIC targets, though power numbers are not supplied.
- The approach could be ported to other lattice-based primitives that rely on similar modular polynomial arithmetic.
- If the representation generalizes, it may simplify software implementations on processors that already support redundant arithmetic.
- Verification of the full design on multiple FPGA families would strengthen the claim that the gains are platform-independent.
Load-bearing premise
The redundant number representation and unified butterfly units can be realized on FPGA without correctness errors, extra latency, or hidden resource costs that cancel the claimed gains.
What would settle it
Implement the accelerator on an FPGA, run NTT and INTT on known test polynomials, and check whether outputs match reference results while measuring clock frequency and resource counts against the reported values.
Figures
read the original abstract
Post-quantum cryptography and privacy-preserving technologies are expected to play a central role in future secure communication systems. Lattice-based PQC schemes such as ML-KEM (CRYSTALS-Kyber) and ML-DSA (CRYSTALS-Dilithium) rely heavily on large-degree polynomial arithmetic, making the Number Theoretic Transform (NTT) a key computational primitive. Although existing hardware accelerators exploit parallelism and pipelining to support both NTT and INTT, their efficiency is often limited by the overhead of modular reduction and correction steps, inverse-transform scaling operations, and suboptimal FPGA implementations. This work addresses these limitations by proposing parallel iterative NTT/INTT accelerators based on optimized unified butterfly units. We introduce a novel redundant number representation that eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations, and integrate inverse-transform scaling into existing arithmetic hardware to avoid dedicated scaling units. Furthermore, we design hierarchical Montgomery multipliers that map efficiently onto FPGA DSP resources, reducing hardware cost while enabling high operating frequencies. FPGA-based experimental results demonstrate higher clock frequencies, reduced execution times, and competitive resource utilization, supporting efficient NTT acceleration for PQC and related privacy-preserving applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes parallel iterative NTT/INTT accelerators for post-quantum cryptography (ML-KEM, ML-DSA) based on optimized unified butterfly units. It introduces a novel redundant number representation claimed to eliminate conditional corrections for Montgomery modulo multiplication and combined subtract-multiply operations, integrates inverse-transform scaling into existing arithmetic hardware, and designs hierarchical Montgomery multipliers for efficient FPGA DSP mapping. FPGA-based experiments are stated to show higher clock frequencies, reduced execution times, and competitive resource utilization.
Significance. If the redundant representation and microarchitecture deliver the claimed elimination of corrections without hidden costs or errors, and if the experimental results are substantiated, the work could improve efficiency of hardware accelerators for lattice-based PQC. The folding of scaling and DSP-efficient multiplier design are potentially useful techniques for reducing overhead in FPGA implementations of NTT.
major comments (2)
- [Abstract] Abstract: the claim that the novel redundant number representation eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations is asserted without equations, proof sketch, simulation trace, or edge-case analysis confirming it holds for the exact moduli and bit-widths in ML-KEM/ML-DSA NTT butterflies. This is load-bearing for the stated frequency, latency, and resource gains.
- [Abstract] Abstract: the abstract asserts performance improvements from the new representation and microarchitecture but supplies no quantitative metrics, error analysis, baseline comparisons, or derivation details, leaving the central claims unsupported by visible evidence.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. We address the two major comments on the abstract below. The supporting technical details, equations, and experimental evidence are contained in the body of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that the novel redundant number representation eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations is asserted without equations, proof sketch, simulation trace, or edge-case analysis confirming it holds for the exact moduli and bit-widths in ML-KEM/ML-DSA NTT butterflies. This is load-bearing for the stated frequency, latency, and resource gains.
Authors: The abstract is a concise summary. The redundant representation, its mathematical formulation, proof that conditional corrections are eliminated for the precise ML-KEM/ML-DSA moduli and bit-widths, simulation traces, and edge-case verification appear in Section III, with the absence of hidden costs or errors substantiated there. These details directly support the reported frequency and latency gains. revision: no
-
Referee: [Abstract] Abstract: the abstract asserts performance improvements from the new representation and microarchitecture but supplies no quantitative metrics, error analysis, baseline comparisons, or derivation details, leaving the central claims unsupported by visible evidence.
Authors: Quantitative metrics (clock frequency, execution time, resource counts), error analysis, baseline comparisons, and derivation details are provided in Section V together with tables and figures. The abstract is length-constrained and highlights contributions; all supporting evidence is visible in the main text. revision: no
Circularity Check
No circularity: design claims rest on novel representation and FPGA experiments, not self-referential fits or citations.
full rationale
The paper presents a hardware architecture proposal for NTT accelerators, introducing a redundant number representation and hierarchical multipliers, then reports FPGA synthesis and timing results. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work. The central performance claims are supported by direct experimental measurements rather than any derivation that collapses to its own inputs. This is the expected non-finding for an applied microarchitecture paper whose validation is external to any internal parameter fitting.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard properties of modular arithmetic, Montgomery multiplication, and NTT correctness hold for the redundant representation.
invented entities (1)
-
Unified redundant number representation for NTT butterfly units
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ziegler, P
V. Ziegler, P. Schneider, H. Viswanathan, M. Montag, A. R. Satish Kanugovi, Security and trust in the 6G era, White Paper, Nokia Bell Labs, accessed: 03.05.2025 (2023). URLhttps://onestore.nokia.com/asset/210527
2025
-
[2]
P. W. Shor, Polynomial-time algorithms for prime factorization and dis- crete logarithms on a quantum computer, SIAM Journal on Computing 26 (5) (1997) 1484–1509
1997
- [3]
-
[4]
Fouque, J
P.-A. Fouque, J. Hoffstein, P. Kirchner, V. Lyubashevsky, T. Pornin, T. Prest, T. Ricosset, G. Seiler, W. Whyte, Z. Zhang, et al., Falcon: Fast-fourier lattice-based compact signatures over ntru, Submission to the NIST’s post-quantum cryptography standardization process 36 (5) (2018) 1–75. 31
2018
-
[5]
Hoffstein, J
J. Hoffstein, J. Pipher, J. H. Silverman, NTRU: A ring-based public key cryptosystem, in: J. P. Buhler (Ed.), Algorithmic Number Theory, Springer Berlin Heidelberg, Berlin, Heidelberg, 1998, pp. 267–288
1998
-
[6]
Schanck, P
R.Avanzi, J.Bos, L.Ducas, E.Kiltz, T.Lepoint, V.Lyubashevsky, J.M. Schanck, P. Schwabe, G. Seiler, D. Stehlé, et al., Crystals-kyber algo- rithm specifications and supporting documentation, NIST PQC Round 2 (4) (2019) 1–43
2019
-
[7]
Ducas, E
L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe, G. Seiler, D. Stehlé, Crystals-dilithium: A lattice-based digital signature scheme, IACR Transactions on Cryptographic Hardware and Embedded Systems 2018 (1) (2018) 238–268
2018
-
[8]
A. C. Mert, E. Karabulut, E. Öztürk, E. Savaş, A. Aysu, An extensive study of flexible design methods for the number theoretic transform, IEEE Transactions on Computers 71 (11) (2022) 2829–2843
2022
-
[9]
J. Mu, Y. Ren, W. Wang, Y. Hu, S. Chen, C.-H. Chang, J. Fan, J. Ye, Y. Cao, H. Li, X. Li, Scalable and conflict-free NTT hardware accel- erator design: Methodology, proof, and implementation, IEEE Trans- actions on Computer-Aided Design of Integrated Circuits and Systems 42 (5) (2023) 1504–1517
2023
-
[10]
Derya, A
K. Derya, A. C. Mert, E. Öztürk, E. Savaş, CoHA-NTT: A configurable hardware accelerator for NTT-based polynomial multiplication, Micro- processors and Microsystems 89 (2022) 104451
2022
-
[11]
Nguyen, B
T.-H. Nguyen, B. Kieu-Do-Nguyen, C.-K. Pham, T.-T. Hoang, High- speed NTT accelerator for crystal-kyber and crystal-dilithium, IEEE Access 12 (2024) 34918–34930
2024
-
[12]
Hirner, A
F. Hirner, A. C. Mert, S. S. Roy, Proteus: A pipelined NTT architecture generator, IEEE Trans. on Very Large Scale Integration (VLSI) Systems 32 (7) (2024) 1228–1238
2024
-
[13]
Liu, C.-Y
S.-H. Liu, C.-Y. Kuo, Y.-N. Mo, T. Su, An area-efficient, conflict-free, and configurable architecture for accelerating NTT/INTT, IEEE Trans- actions on Very Large Scale Integration (VLSI) Systems 32 (3) (2024) 519–529. 32
2024
-
[14]
Y. Kong, Optimizing the improved barrett modular multipliers for public-key cryptography, in: International Conference on Computa- tional Intelligence and Software Engineering, 2010, pp. 1–4
2010
-
[15]
Montgomery, Modular multiplication without trial division, Mathe- matics of Computation 44 (170) (1985) 519–521
P. Montgomery, Modular multiplication without trial division, Mathe- matics of Computation 44 (170) (1985) 519–521
1985
-
[16]
Harvey, Faster arithmetic for number-theoretic transforms, Journal of Symbolic Computation 60 (2014) 113–119
D. Harvey, Faster arithmetic for number-theoretic transforms, Journal of Symbolic Computation 60 (2014) 113–119
2014
-
[17]
Z. Ye, R. Song, H. Zhang, D. Chen, R. Cheung, K. Huang, A highly- efficient lattice-based post-quantum cryptography processor for iot ap- plications, IACR Transactions on Cryptographic Hardware and Embed- ded Systems 2024 (2) (2024) 130–153
2024
-
[18]
Cohen, A Course in Computational Algebraic Number Theory, Springer-Verlag, New York, 1993
H. Cohen, A Course in Computational Algebraic Number Theory, Springer-Verlag, New York, 1993
1993
-
[19]
A. E. Kaya, Post-quantum cryptography and NTT as a polynomial multiplication method, Ph.D. thesis, Middle East Technical University (2023)
2023
-
[20]
Heckbert, Fourier transforms and the fast fourier transform (FFT) algorithm, Computer Graphics 2 (1995) (1995) 15–463
P. Heckbert, Fourier transforms and the fast fourier transform (FFT) algorithm, Computer Graphics 2 (1995) (1995) 15–463
1995
-
[21]
Di Matteo, M
S. Di Matteo, M. L. Gerfo, S. Saponara, VLSI design and FPGA im- plementation of an NTT hardware accelerator for homomorphic seal- embedded library, IEEE Access 11 (2023) 72498–72508
2023
-
[22]
X. Chen, B. Yang, S. Yin, S. Wei, L. Liu, Cfntt: Scalable radix-2/4 NTT multiplication architecture with an efficient conflict-free memory mapping scheme, IACR Transactions on Cryptographic Hardware and Embedded Systems 2022 (1) (2021) 94–126
2022
-
[23]
Krieger, F
F. Krieger, F. Hirner, A. C. Mert, S. Sinha Roy, OpenNTT - an au- tomated toolchain for compiling high-performance NTT accelerators in FHE, in: 43rd IEEE/ACM International Conference on Computer- Aided Design, 2025, p. 1–9
2025
-
[24]
Garrido, A survey on pipelined FFT hardware architectures, Journal of Signal Processing Systems 94 (11) (2022) 1345–1364
M. Garrido, A survey on pipelined FFT hardware architectures, Journal of Signal Processing Systems 94 (11) (2022) 1345–1364. 33
2022
-
[25]
Alexakis, D
G. Alexakis, D. Schoinianakis, G. Dimitrakopoulos, High-performance pipelined NTT accelerators with homogeneous digit-serial modulo arith- metic, in: Euromicro Conference on Digital System Design (DSD), 2025, pp. 394–401
2025
-
[26]
Zhang, S
Y. Zhang, S. Wang, X. Zhang, J. Dong, X. Mao, F. Long, C. Wang, D. Zhou, M. Gao, G. Sun, PipeZK: Accelerating zero-knowledge proof with a pipelined architecture, in: ACM/IEEE Intern. Symp. on Com- puter Architecture (ISCA), 2021, pp. 416–428
2021
-
[27]
C. Wang, M. Gao, Sam: A scalable accelerator for number theoretic transform using multi-dimensional decomposition, in: IEEE/ACM In- ternational Conference on Computer Aided Design (ICCAD), 2023, pp. 1–9
2023
-
[28]
Y. Yang, S. R. Kuppannagari, R. Kannan, V. K. Prasanna, NTTGen: a framework for generating low latency NTT implementations on fpga, in: 19th ACM International Conference on Computing Frontiers, 2022, p. 30–39
2022
-
[29]
Kumarathunga, Q
D. Kumarathunga, Q. Hu, Z. Fang, AutoNTT: Automatic architecture design and exploration for number theoretic transform acceleration on fpgas, in: International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2025, pp. 1–9
2025
-
[30]
Z.Ni, A.Khalid, W.Liu, M.O’Neill, TowardsalightweightCRYSTALS- Kyber in FPGAs: an ultra-lightweight BRAM-free NTT core, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5
2023
-
[31]
D. Li, A. Pakala, K. Yang, MeNTT: A compact and efficient processing- in-memory number theoretic transform (NTT) accelerator, IEEE Trans- actions on Very Large Scale Integration (VLSI) Systems 30 (5) (2022) 579–588
2022
-
[32]
Q. Pu, X. Zhao, Montgomery exponentiation with no final comparisons: Improved results, in: 2009 Pacific-Asia Conference on Circuits, Commu- nications and Systems, 2009, pp. 614–616
2009
-
[33]
Zhang, Y
X. Zhang, Y. Wei, M. Li, J. Tian, Z. Wang, Hrcim-ntt: An efficient compute-in-memory NTT accelerator with hybrid-redundant numbers, 34 IEEE Transactions on Circuits and Systems I: Regular Papers 72 (1) (2025) 214–227
2025
-
[34]
Taghavi, R
B. Taghavi, R. Azarderakhsh, M. M. Kermani, KNightCore: An Ultra-Lightweight NTT-Based Polynomial Multiplier for ML-KEM on Resource-Constrained Platforms, in: Workshop on Quantum-Resistant Cryptography and Security, 2025, pp. 73–77
2025
-
[35]
Taghavi, R
B. Taghavi, R. Azarderakhsh, M. Mozaffari Kermani, LightNTT: A Tiny NTT/iNTT Core for ML-DSA Featuring a Constant-Geometry Pipelined Design, in: Lightweight Cryptography for Security and Pri- vacy, Springer Nature, 2026, pp. 57–76
2026
-
[36]
Bisheh-Niasar, R
M. Bisheh-Niasar, R. Azarderakhsh, M. Mozaffari-Kermani, High-speed NTT-based polynomial multiplication accelerator for post-quantum cryptography, in: IEEE Symposium on Computer Arithmetic (ARITH), 2021, pp. 94–101
2021
-
[37]
W. Guo, S. Li, Highly-efficient hardware architecture for crystals-kyber with a novel conflict-free memory access pattern, IEEE Transactions on Circuits and Systems I: Regular Papers 70 (11) (2023) 4505–4515
2023
-
[38]
Zhang, B
N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, L. Liu, Highly efficient architecture of newhope-nist on fpga using low-complexity NTT/INTT, IACR Transactions on Cryptographic Hardware and Embedded Systems 2020 (2) (2020) 49–72
2020
-
[39]
Alkim, R
E. Alkim, R. Avanzi, J. Bos, L. Ducas, A. de la Piedra, T. Pöppelmann, P. Schwabe, D. Stebila, Newhope algorithm specifications and support- ing documentation, Version 1 (2018) 1–44
2018
-
[40]
H. Yang, R. Chen, Q. Wang, Z. Wu, W. Peng, Hardware acceleration of NTT-based polynomial multiplication in crystals-kyber, in: C. Ge, M. Yung (Eds.), Information Security and Cryptology, Springer Nature Singapore, Singapore, 2024, pp. 111–129. 35
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.