pith. sign in

arxiv: 2607.00621 · v1 · pith:TOBLXARVnew · submitted 2026-07-01 · 💻 cs.AR · cs.CR

High-Performance NTT Accelerators for PQC leveraging Unified Redundant Arithmetic and Fine-Tuned Microarchitecture

Pith reviewed 2026-07-02 04:18 UTC · model grok-4.3

classification 💻 cs.AR cs.CR
keywords NTTPQCFPGAMontgomery multiplicationRedundant arithmeticButterfly unitLattice-based cryptographyHardware accelerator
0
0 comments X

The pith

A redundant number representation removes conditional corrections from NTT hardware for post-quantum cryptography.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops parallel iterative NTT and INTT accelerators for lattice-based schemes such as ML-KEM and ML-DSA by introducing unified butterfly units. A novel redundant number representation is presented that removes the need for conditional corrections during Montgomery modulo multiplication and combined subtract-multiply steps. Inverse-transform scaling is folded into the existing arithmetic hardware, and hierarchical Montgomery multipliers are mapped to FPGA DSP blocks. FPGA measurements are reported to show higher operating frequencies, shorter run times, and comparable hardware cost. These changes target the modular reduction overhead that limits current NTT accelerators in PQC applications.

Core claim

The paper claims that a novel redundant number representation eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations, integrates inverse-transform scaling into existing arithmetic hardware, and supports hierarchical Montgomery multipliers that map efficiently onto FPGA DSP resources, resulting in higher clock frequencies, reduced execution times, and competitive resource utilization for NTT/INTT acceleration.

What carries the argument

Unified butterfly units built on a redundant number representation that removes conditional corrections and folds scaling into the arithmetic path, together with hierarchical Montgomery multipliers mapped to FPGA DSP blocks.

If this is right

  • NTT and INTT operations complete in fewer cycles at higher clock rates on FPGA.
  • No separate scaling hardware is required after the inverse transform.
  • Montgomery multipliers occupy fewer DSP blocks while sustaining higher frequencies.
  • The accelerators support both encryption and signature schemes in the same hardware.
  • Resource utilization remains competitive with prior designs while delivering speed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same redundant representation might reduce power draw on FPGA or ASIC targets, though power numbers are not supplied.
  • The approach could be ported to other lattice-based primitives that rely on similar modular polynomial arithmetic.
  • If the representation generalizes, it may simplify software implementations on processors that already support redundant arithmetic.
  • Verification of the full design on multiple FPGA families would strengthen the claim that the gains are platform-independent.

Load-bearing premise

The redundant number representation and unified butterfly units can be realized on FPGA without correctness errors, extra latency, or hidden resource costs that cancel the claimed gains.

What would settle it

Implement the accelerator on an FPGA, run NTT and INTT on known test polynomials, and check whether outputs match reference results while measuring clock frequency and resource counts against the reported values.

Figures

Figures reproduced from arXiv: 2607.00621 by Dimitrios Schoinianakis, George Alexakis, Giorgos Dimitrakopoulos.

Figure 1
Figure 1. Figure 1: Fast NTT computation following the Fast-Fourier Transform recursive paradigm. [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Parallel NTT accelerator that computes NTT iteratively using an array of PEs [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The two main forms of pipelined NTT hardware accelerators. Each numbered [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Internal connectivity of a 4 × 2 PE array. 3. Proposed architecture The proposed approach is based on the parallel NTT accelerator archi￾tecture shown in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The unified butterfly structure from [38] serves as the PE for the proposed archi [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The Unified butterfly unit structure proposed in [38] highlighting the conditional [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The unified butterfly unit covering both NTT and INTT operating on an ex [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The cascaded structure of modular addition, used in both NTT and INTT [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The proposed unified butterfly unit supports both NTT and INTT computations. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mapping Montgomery multiplication to DSP blocks for inputs of 17 bits as [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Mapping Montgomery multiplication to DSP blocks for 34-bit inputs to support [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

Post-quantum cryptography and privacy-preserving technologies are expected to play a central role in future secure communication systems. Lattice-based PQC schemes such as ML-KEM (CRYSTALS-Kyber) and ML-DSA (CRYSTALS-Dilithium) rely heavily on large-degree polynomial arithmetic, making the Number Theoretic Transform (NTT) a key computational primitive. Although existing hardware accelerators exploit parallelism and pipelining to support both NTT and INTT, their efficiency is often limited by the overhead of modular reduction and correction steps, inverse-transform scaling operations, and suboptimal FPGA implementations. This work addresses these limitations by proposing parallel iterative NTT/INTT accelerators based on optimized unified butterfly units. We introduce a novel redundant number representation that eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations, and integrate inverse-transform scaling into existing arithmetic hardware to avoid dedicated scaling units. Furthermore, we design hierarchical Montgomery multipliers that map efficiently onto FPGA DSP resources, reducing hardware cost while enabling high operating frequencies. FPGA-based experimental results demonstrate higher clock frequencies, reduced execution times, and competitive resource utilization, supporting efficient NTT acceleration for PQC and related privacy-preserving applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes parallel iterative NTT/INTT accelerators for post-quantum cryptography (ML-KEM, ML-DSA) based on optimized unified butterfly units. It introduces a novel redundant number representation claimed to eliminate conditional corrections for Montgomery modulo multiplication and combined subtract-multiply operations, integrates inverse-transform scaling into existing arithmetic hardware, and designs hierarchical Montgomery multipliers for efficient FPGA DSP mapping. FPGA-based experiments are stated to show higher clock frequencies, reduced execution times, and competitive resource utilization.

Significance. If the redundant representation and microarchitecture deliver the claimed elimination of corrections without hidden costs or errors, and if the experimental results are substantiated, the work could improve efficiency of hardware accelerators for lattice-based PQC. The folding of scaling and DSP-efficient multiplier design are potentially useful techniques for reducing overhead in FPGA implementations of NTT.

major comments (2)
  1. [Abstract] Abstract: the claim that the novel redundant number representation eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations is asserted without equations, proof sketch, simulation trace, or edge-case analysis confirming it holds for the exact moduli and bit-widths in ML-KEM/ML-DSA NTT butterflies. This is load-bearing for the stated frequency, latency, and resource gains.
  2. [Abstract] Abstract: the abstract asserts performance improvements from the new representation and microarchitecture but supplies no quantitative metrics, error analysis, baseline comparisons, or derivation details, leaving the central claims unsupported by visible evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address the two major comments on the abstract below. The supporting technical details, equations, and experimental evidence are contained in the body of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the novel redundant number representation eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations is asserted without equations, proof sketch, simulation trace, or edge-case analysis confirming it holds for the exact moduli and bit-widths in ML-KEM/ML-DSA NTT butterflies. This is load-bearing for the stated frequency, latency, and resource gains.

    Authors: The abstract is a concise summary. The redundant representation, its mathematical formulation, proof that conditional corrections are eliminated for the precise ML-KEM/ML-DSA moduli and bit-widths, simulation traces, and edge-case verification appear in Section III, with the absence of hidden costs or errors substantiated there. These details directly support the reported frequency and latency gains. revision: no

  2. Referee: [Abstract] Abstract: the abstract asserts performance improvements from the new representation and microarchitecture but supplies no quantitative metrics, error analysis, baseline comparisons, or derivation details, leaving the central claims unsupported by visible evidence.

    Authors: Quantitative metrics (clock frequency, execution time, resource counts), error analysis, baseline comparisons, and derivation details are provided in Section V together with tables and figures. The abstract is length-constrained and highlights contributions; all supporting evidence is visible in the main text. revision: no

Circularity Check

0 steps flagged

No circularity: design claims rest on novel representation and FPGA experiments, not self-referential fits or citations.

full rationale

The paper presents a hardware architecture proposal for NTT accelerators, introducing a redundant number representation and hierarchical multipliers, then reports FPGA synthesis and timing results. No equations reduce a claimed prediction to a fitted input by construction, no load-bearing uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work. The central performance claims are supported by direct experimental measurements rather than any derivation that collapses to its own inputs. This is the expected non-finding for an applied microarchitecture paper whose validation is external to any internal parameter fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the correctness and efficiency of the introduced redundant arithmetic representation when mapped to FPGA hardware; no numerical free parameters are mentioned, and the work relies on standard modular arithmetic properties.

axioms (1)
  • standard math Standard properties of modular arithmetic, Montgomery multiplication, and NTT correctness hold for the redundant representation.
    Invoked implicitly when claiming elimination of conditional corrections without loss of correctness.
invented entities (1)
  • Unified redundant number representation for NTT butterfly units no independent evidence
    purpose: Eliminates conditional corrections in Montgomery modulo multiplication and subtract-multiply operations while integrating scaling.
    New representation introduced to optimize the arithmetic hardware; no independent evidence outside the design is provided.

pith-pipeline@v0.9.1-grok · 5749 in / 1555 out tokens · 155728 ms · 2026-07-02T04:18:10.464745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 1 canonical work pages

  1. [1]

    Ziegler, P

    V. Ziegler, P. Schneider, H. Viswanathan, M. Montag, A. R. Satish Kanugovi, Security and trust in the 6G era, White Paper, Nokia Bell Labs, accessed: 03.05.2025 (2023). URLhttps://onestore.nokia.com/asset/210527

  2. [2]

    P. W. Shor, Polynomial-time algorithms for prime factorization and dis- crete logarithms on a quantum computer, SIAM Journal on Computing 26 (5) (1997) 1484–1509

  3. [3]

    Liang, Y

    Z. Liang, Y. Zhao, Number theoretic transform and its applications in lattice-based cryptosystems: A survey, arXiv preprint arXiv:2211.13546 (2022)

  4. [4]

    Fouque, J

    P.-A. Fouque, J. Hoffstein, P. Kirchner, V. Lyubashevsky, T. Pornin, T. Prest, T. Ricosset, G. Seiler, W. Whyte, Z. Zhang, et al., Falcon: Fast-fourier lattice-based compact signatures over ntru, Submission to the NIST’s post-quantum cryptography standardization process 36 (5) (2018) 1–75. 31

  5. [5]

    Hoffstein, J

    J. Hoffstein, J. Pipher, J. H. Silverman, NTRU: A ring-based public key cryptosystem, in: J. P. Buhler (Ed.), Algorithmic Number Theory, Springer Berlin Heidelberg, Berlin, Heidelberg, 1998, pp. 267–288

  6. [6]

    Schanck, P

    R.Avanzi, J.Bos, L.Ducas, E.Kiltz, T.Lepoint, V.Lyubashevsky, J.M. Schanck, P. Schwabe, G. Seiler, D. Stehlé, et al., Crystals-kyber algo- rithm specifications and supporting documentation, NIST PQC Round 2 (4) (2019) 1–43

  7. [7]

    Ducas, E

    L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe, G. Seiler, D. Stehlé, Crystals-dilithium: A lattice-based digital signature scheme, IACR Transactions on Cryptographic Hardware and Embedded Systems 2018 (1) (2018) 238–268

  8. [8]

    A. C. Mert, E. Karabulut, E. Öztürk, E. Savaş, A. Aysu, An extensive study of flexible design methods for the number theoretic transform, IEEE Transactions on Computers 71 (11) (2022) 2829–2843

  9. [9]

    J. Mu, Y. Ren, W. Wang, Y. Hu, S. Chen, C.-H. Chang, J. Fan, J. Ye, Y. Cao, H. Li, X. Li, Scalable and conflict-free NTT hardware accel- erator design: Methodology, proof, and implementation, IEEE Trans- actions on Computer-Aided Design of Integrated Circuits and Systems 42 (5) (2023) 1504–1517

  10. [10]

    Derya, A

    K. Derya, A. C. Mert, E. Öztürk, E. Savaş, CoHA-NTT: A configurable hardware accelerator for NTT-based polynomial multiplication, Micro- processors and Microsystems 89 (2022) 104451

  11. [11]

    Nguyen, B

    T.-H. Nguyen, B. Kieu-Do-Nguyen, C.-K. Pham, T.-T. Hoang, High- speed NTT accelerator for crystal-kyber and crystal-dilithium, IEEE Access 12 (2024) 34918–34930

  12. [12]

    Hirner, A

    F. Hirner, A. C. Mert, S. S. Roy, Proteus: A pipelined NTT architecture generator, IEEE Trans. on Very Large Scale Integration (VLSI) Systems 32 (7) (2024) 1228–1238

  13. [13]

    Liu, C.-Y

    S.-H. Liu, C.-Y. Kuo, Y.-N. Mo, T. Su, An area-efficient, conflict-free, and configurable architecture for accelerating NTT/INTT, IEEE Trans- actions on Very Large Scale Integration (VLSI) Systems 32 (3) (2024) 519–529. 32

  14. [14]

    Y. Kong, Optimizing the improved barrett modular multipliers for public-key cryptography, in: International Conference on Computa- tional Intelligence and Software Engineering, 2010, pp. 1–4

  15. [15]

    Montgomery, Modular multiplication without trial division, Mathe- matics of Computation 44 (170) (1985) 519–521

    P. Montgomery, Modular multiplication without trial division, Mathe- matics of Computation 44 (170) (1985) 519–521

  16. [16]

    Harvey, Faster arithmetic for number-theoretic transforms, Journal of Symbolic Computation 60 (2014) 113–119

    D. Harvey, Faster arithmetic for number-theoretic transforms, Journal of Symbolic Computation 60 (2014) 113–119

  17. [17]

    Z. Ye, R. Song, H. Zhang, D. Chen, R. Cheung, K. Huang, A highly- efficient lattice-based post-quantum cryptography processor for iot ap- plications, IACR Transactions on Cryptographic Hardware and Embed- ded Systems 2024 (2) (2024) 130–153

  18. [18]

    Cohen, A Course in Computational Algebraic Number Theory, Springer-Verlag, New York, 1993

    H. Cohen, A Course in Computational Algebraic Number Theory, Springer-Verlag, New York, 1993

  19. [19]

    A. E. Kaya, Post-quantum cryptography and NTT as a polynomial multiplication method, Ph.D. thesis, Middle East Technical University (2023)

  20. [20]

    Heckbert, Fourier transforms and the fast fourier transform (FFT) algorithm, Computer Graphics 2 (1995) (1995) 15–463

    P. Heckbert, Fourier transforms and the fast fourier transform (FFT) algorithm, Computer Graphics 2 (1995) (1995) 15–463

  21. [21]

    Di Matteo, M

    S. Di Matteo, M. L. Gerfo, S. Saponara, VLSI design and FPGA im- plementation of an NTT hardware accelerator for homomorphic seal- embedded library, IEEE Access 11 (2023) 72498–72508

  22. [22]

    X. Chen, B. Yang, S. Yin, S. Wei, L. Liu, Cfntt: Scalable radix-2/4 NTT multiplication architecture with an efficient conflict-free memory mapping scheme, IACR Transactions on Cryptographic Hardware and Embedded Systems 2022 (1) (2021) 94–126

  23. [23]

    Krieger, F

    F. Krieger, F. Hirner, A. C. Mert, S. Sinha Roy, OpenNTT - an au- tomated toolchain for compiling high-performance NTT accelerators in FHE, in: 43rd IEEE/ACM International Conference on Computer- Aided Design, 2025, p. 1–9

  24. [24]

    Garrido, A survey on pipelined FFT hardware architectures, Journal of Signal Processing Systems 94 (11) (2022) 1345–1364

    M. Garrido, A survey on pipelined FFT hardware architectures, Journal of Signal Processing Systems 94 (11) (2022) 1345–1364. 33

  25. [25]

    Alexakis, D

    G. Alexakis, D. Schoinianakis, G. Dimitrakopoulos, High-performance pipelined NTT accelerators with homogeneous digit-serial modulo arith- metic, in: Euromicro Conference on Digital System Design (DSD), 2025, pp. 394–401

  26. [26]

    Zhang, S

    Y. Zhang, S. Wang, X. Zhang, J. Dong, X. Mao, F. Long, C. Wang, D. Zhou, M. Gao, G. Sun, PipeZK: Accelerating zero-knowledge proof with a pipelined architecture, in: ACM/IEEE Intern. Symp. on Com- puter Architecture (ISCA), 2021, pp. 416–428

  27. [27]

    C. Wang, M. Gao, Sam: A scalable accelerator for number theoretic transform using multi-dimensional decomposition, in: IEEE/ACM In- ternational Conference on Computer Aided Design (ICCAD), 2023, pp. 1–9

  28. [28]

    Y. Yang, S. R. Kuppannagari, R. Kannan, V. K. Prasanna, NTTGen: a framework for generating low latency NTT implementations on fpga, in: 19th ACM International Conference on Computing Frontiers, 2022, p. 30–39

  29. [29]

    Kumarathunga, Q

    D. Kumarathunga, Q. Hu, Z. Fang, AutoNTT: Automatic architecture design and exploration for number theoretic transform acceleration on fpgas, in: International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2025, pp. 1–9

  30. [30]

    Z.Ni, A.Khalid, W.Liu, M.O’Neill, TowardsalightweightCRYSTALS- Kyber in FPGAs: an ultra-lightweight BRAM-free NTT core, in: IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5

  31. [31]

    D. Li, A. Pakala, K. Yang, MeNTT: A compact and efficient processing- in-memory number theoretic transform (NTT) accelerator, IEEE Trans- actions on Very Large Scale Integration (VLSI) Systems 30 (5) (2022) 579–588

  32. [32]

    Q. Pu, X. Zhao, Montgomery exponentiation with no final comparisons: Improved results, in: 2009 Pacific-Asia Conference on Circuits, Commu- nications and Systems, 2009, pp. 614–616

  33. [33]

    Zhang, Y

    X. Zhang, Y. Wei, M. Li, J. Tian, Z. Wang, Hrcim-ntt: An efficient compute-in-memory NTT accelerator with hybrid-redundant numbers, 34 IEEE Transactions on Circuits and Systems I: Regular Papers 72 (1) (2025) 214–227

  34. [34]

    Taghavi, R

    B. Taghavi, R. Azarderakhsh, M. M. Kermani, KNightCore: An Ultra-Lightweight NTT-Based Polynomial Multiplier for ML-KEM on Resource-Constrained Platforms, in: Workshop on Quantum-Resistant Cryptography and Security, 2025, pp. 73–77

  35. [35]

    Taghavi, R

    B. Taghavi, R. Azarderakhsh, M. Mozaffari Kermani, LightNTT: A Tiny NTT/iNTT Core for ML-DSA Featuring a Constant-Geometry Pipelined Design, in: Lightweight Cryptography for Security and Pri- vacy, Springer Nature, 2026, pp. 57–76

  36. [36]

    Bisheh-Niasar, R

    M. Bisheh-Niasar, R. Azarderakhsh, M. Mozaffari-Kermani, High-speed NTT-based polynomial multiplication accelerator for post-quantum cryptography, in: IEEE Symposium on Computer Arithmetic (ARITH), 2021, pp. 94–101

  37. [37]

    W. Guo, S. Li, Highly-efficient hardware architecture for crystals-kyber with a novel conflict-free memory access pattern, IEEE Transactions on Circuits and Systems I: Regular Papers 70 (11) (2023) 4505–4515

  38. [38]

    Zhang, B

    N. Zhang, B. Yang, C. Chen, S. Yin, S. Wei, L. Liu, Highly efficient architecture of newhope-nist on fpga using low-complexity NTT/INTT, IACR Transactions on Cryptographic Hardware and Embedded Systems 2020 (2) (2020) 49–72

  39. [39]

    Alkim, R

    E. Alkim, R. Avanzi, J. Bos, L. Ducas, A. de la Piedra, T. Pöppelmann, P. Schwabe, D. Stebila, Newhope algorithm specifications and support- ing documentation, Version 1 (2018) 1–44

  40. [40]

    H. Yang, R. Chen, Q. Wang, Z. Wu, W. Peng, Hardware acceleration of NTT-based polynomial multiplication in crystals-kyber, in: C. Ge, M. Yung (Eds.), Information Security and Cryptology, Springer Nature Singapore, Singapore, 2024, pp. 111–129. 35