cs.AR — Pith

0

cs.AR 2026-07-03

p-MEM samples Gaussians at full memory bandwidth

by Likai Pei, Jiahao Zheng +10 more

Probabilistic Memory for Trustworthy Edge Intelligence

Storing distribution parameters cuts sampling latency by hundreds of times and energy by up to 295x in Bayesian network workloads on CPU and

abstract click to expand

Probabilistic computation plays an important role in trustworthy edge intelligence to quantify uncertainty, enhance robustness, reconstruct data, and protect privacy, but its adoption is limited by the orders-of-magnitude data throughput gap between Gaussian random number generation (GRNG) and computation, as well as instruction overhead. This paper introduces probabilistic memory (p-MEM), a unified memory primitive that stores distribution parameters, such as mean and standard deviation, and samples directly at the native memory bandwidth, where deterministic data becomes the zero-variance special case. Using a layout-validated p-MEM simulator, we comprehensively explore device choices, memory specifications, and technology nodes, showing that p-MEM can achieve more than 1000 GSa/s/mm^2 GRNG throughput, including memory-array access. Integrated into CPU/GPU systems, p-MEM reduces instruction count by up to 2.19x/4.37x, sampling latency by 562x/3.45x, and energy by 295.5x/3.53x for Bayesian neural network workloads, providing a scalable hardware substrate for trustworthy probabilistic AI.

0

physics.ins-det 2026-07-03

Framework unifies multi-FPGA hardware and software for smart TDAQ

by Roberto Ammendola, Andrea Biagioni +13 more

APEIRON: composing smart TDAQ systems for high energy physics experiments

APEIRON covers device drivers to HLS dataflow models for real-time particle physics triggers like NA62.

abstract click to expand

We present APEIRON, a distributed heterogeneous processing framework comprising both hardware architecture and software stack for multi-FPGA systems. Targeting smart trigger and data acquisition (TDAQ) systems in high energy physics, APEIRON spans the full software hierarchy: from low-level device drivers to a high-level dataflow programming model based on High-Level Synthesis. We describe the framework design, its core communication infrastructure, and a particle identification application for the NA62 experiment as a representative physics use case.

0

cond-mat.mes-hall 2026-07-03

Delay-line machine solves 2048-spin Ising problems

by Venkatesh Vadde, Roman Ovcharov +4 more

A 2048-spin bulk acoustic wave Ising machine for number partitioning and Sudoku

Bulk acoustic waves deliver all-to-all connectivity and four orders higher thermal stability than optical coherent Ising machines.

abstract click to expand

Optical coherent Ising machines based on time-multiplexing have demonstrated significant progress in terms of connectivity and spin scalability. However, they are constrained by large physical footprints, high power consumption, poor thermal stability, and high cost. Here, we present a time-multiplexed Ising machine leveraging propagating wave packets in solid-state delay lines at microwave frequencies, enabling thermally stable, robust, low-power, tabletop, and affordable design. We use two serially connected 20.5 MHz, 707 {\mu}s bulk acoustic wave delay lines supporting 2,048 spins. Our design provides all-to-all connectivity with 15-bit coupling resolution and finds approximate MAX-CUT solutions in 341 ms, potentially scalable to sub-ms by using higher frequency delay lines. Additionally, we demonstrate solutions to number partitioning and Sudoku problems. Compared with state-of-the-art Coherent Ising machines, our machine exhibits four orders of magnitude higher thermal stability. Against the simulated bifurcation algorithm, our design achieves comparable results on the MAX-CUT problem, while outperforming it on the more complex number-partitioning and Sudoku problems.

0

cs.AR 2026-07-03

16-segment approx keeps ViT softmax accurate to 0.2% on FPGA

by Muhammad Usman, Shujaat Khan +1 more

Approximate Attention Weighting for Sustainable FPGA-Based Vision Transformer Inference

The BRAM-free unit uses only LUTs and maintains pre-trained model behavior without recalibration.

abstract click to expand

Vision Transformers have reshaped computer vision by using self-attention to capture global context across image regions. This makes them attractive for edge visual inspection and monitoring in applications such as renewable-energy infrastructure, industrial quality control, medical imaging, and autonomous-system sensing. However, deploying ViTs on small FPGAs remains challenging because the softmax stage in self-attention requires exponential evaluation and normalization, which are costly in hardware. Existing implementations often rely on CORDIC pipelines or BRAM-based look-up tables, increasing area and power consumption. This paper presents a BRAM-free approximate attention-weighting unit for FPGA-based ViT inference. The proposed design approximates the natural exponential in softmax using a 16-segment piecewise-linear function implemented entirely with distributed LUT fabric. Unlike base-2 approximations, the natural-exponential formulation preserves the pre-trained attention temperature and avoids model-specific recalibration. Implemented on a Xilinx Zynq-7020, the complete attention-row core uses 1444 LUTs, 77 DSPs, and no BRAM, while hardware-accurate emulation shows accuracy within a $0.20\%$ absolute top-1 difference from the exact-softmax reference on ViT-family models. These results demonstrate the potential of the proposed core for energy-efficient ViT inference on resource-constrained edge-AI platforms.

0

cs.AR 2026-07-03

3D stacking isolates KV-cache traffic to cut LLM serving latency

by Jaehun Lee, In-Jun Jung +1 more

3DLS: A 3D Logic-Stacked Architecture for Disaggregated LLM Serving

Vertical routes for transfers reduce decode-path contention, yielding up to 1.49x throughput and 60 percent lower end-to-end latency versus

abstract click to expand

Large language model (LLM) serving increasingly combines prefill-decode (PD) disaggregation with tensor parallelism (TP) to support large models and long contexts. In conventional 2D/2.5D chiplet architectures, layer-wise prefill-to-decode KV-cache transfer decode-side TP collectives share the same lateral die-to-die (D2D) interconnect, creating mixed-traffic contention on the decode critical path. This contention increases communication latency, prolongs token generation intervals, and degrades end-to-end serving performance. We propose 3DLS, a logic-on-logic 3D-stacked chiplet architecture that separates traffic classes by routing KV-cache transfers through vertical interconnects while preserving decode-side TP collectives on the lateral D2D fabric. 3DLS achieves up to 1.49$\times$ throughput and 60.2\% lower end-to-end (E2E) latency over the shared-fabric planar baseline, and still achieves up to 1.17$\times$ throughput and 31.4\% lower E2E latency over a workload-aware priority-managed planar baseline. These results highlight that physical isolation is an important design principle for future chiplet-based PD-disaggregated LLM serving systems.

0

cs.AR 2026-07-03

Single LUT design handles both FP8-INT4 and FP8-FP8 GEMM

by Weiyu Zhou, Chen Ding +8 more

MxGLUT: A Reconfigurable LUT-Centric Broadcast Dataflow Accelerator for Mixed-Precision GEMM

Cuts multiplier area 57 percent and prefill latency up to 2.16 times on Llama models at under 2 percent perplexity cost.

abstract click to expand

Large language model (LLM) inference suffers from growing inefficiency across the prefill and decode phases, especially under weight-only quantization, where activations remain in FP8 while weights are compressed to low-bit integers. Existing LUT-based accelerators mainly target FP8-INT4 computation and still rely on separate floating-point (FP) datapaths for attention GEMM operations, leading to redundant hardware and non-unified mixed-precision execution. Moreover, their static dataflows are poorly matched to the distinct prefill and decode phases. To address these challenges, we propose MxGLUT, a reconfigurable LUT-centric broadcast (RLB) dataflow accelerator built on mixed-precision LUT-based processing elements (MxLPEs). Guided by a unified LUT-based execution framework, MxGLUT organizes both FP8-INT4 and FP8-FP8 GEMMs under a single LUT-based compute mechanism without dedicated FP multipliers or additional FP datapaths, and further adopts the RLB dataflow that localizes heavy partial-sum accumulation during the prefill phase and exploits weight reuse in the decode phase. Synthesized in UMC $28\,\mathrm{nm}$ CMOS at $200~\mathrm{MHz}$, MxGLUT reduces multiplier area by up to $56.92\%$ and power by up to $77.07\%$ and $78.35\%$ in FP8-INT4 and FP8-FP8 modes, respectively. At the accelerator level, MxGLUT achieves an area efficiency of $0.492~\mathrm{TFLOPS/mm^2}$ and an energy efficiency of $11.58~\mathrm{TFLOPS/W}$, while adding native FP8-FP8 support incurs only $2.57\%$ and $3.34\%$ reductions in area and energy efficiency, respectively, relative to the FP8-INT4-only FIGLUT baseline. Across the Llama family, MxGLUT achieves up to $2.16\times$ and $1.49\times$ latency speedup, and reduces normalized energy to $0.44\times$ and $0.71\times$ in prefill and decode, respectively, with at most $1.70\%$ perplexity increase.

0

cs.CR 2026-07-02

Tampered cell libraries mask hardware Trojans from chip designers

by Harish Kumar Dharavath, Md Muhtasim Alam Chowdhury +2 more

LIB-TRAP: Standard Cell Library Hardware Trojan Risk Assessment and Prevention

A foundry can swap deactivated Trojan cells for active ones during fabrication, shown on AES-128 and other benchmarks in 32nm and 130nm tech

abstract click to expand

Vulnerabilities inherent to the fabless semiconductor manufacturing model have significantly increased the risk of malicious Hardware Trojan (HT) insertion, posing severe threats to hardware security. Several HT mitigation and detection strategies have been developed, and existing works explore the insertion of HTs in the space between standard cells in an integrated circuit. However, there is a lack of research into the vulnerabilities posed by the building blocks of most digital designs on the market today, the standard cells. This work investigates a novel threat model in which standard cells are considered untrusted. Our proposed threat model provides the design house with a tampered standard cell library. The intended netlist is synthesized and implemented using the tampered library. During fabrication, a nefarious foundry replaces the library's deactivated HT cells with activated counterparts. Using open-source and industry-standard Electronic Design Automation (EDA) tools, existing standard cell libraries, Saed32nm and Sky130nm, are converted into malicious libraries capable of masking the presence of arbitrary HTs from IC designers. The malicious library is then applied and characterized in multiple standard benchmark designs. To demonstrate the efficacy and stealthiness of this standard cell-based attack vector, three benchmark circuits, an AES-128 encryption core, an Ethernet controller, and a WISHBONE DMA engine, were synthesized using both clean and Trojan-infected libraries across Synopsys 32nm and SkyWater 130nm technologies. Design-level features, including total cell count, total area, dynamic power consumption, and static power, were extracted from these synthesized circuits to serve as inputs for binary classification

0

cs.AR 2026-07-02

Preemptive VCs cut link resources by 76% in AXI NoCs

by Lorenzo Leone, Luca Colagrande +1 more

Physically-Aware Preemptive Virtual Channels for Deadlock-Free AXI Networks-on-Chip

Matches multiplane frequency at 3% area cost by separating read-write traffic without duplicating links.

abstract click to expand

As many-core Systems-on-Chip (SoCs) continue to scale, Networks-on-Chip (NoCs) must sustain increasingly high memory bandwidth while preserving deadlock freedom. In AXI4 systems, protocol-level dependencies between read and write traffic can create circular waits at the network endpoints, even when the routing algorithm itself is deadlock-free. Decoupling these traffic classes avoids such dependencies, but exposes a key implementation trade-off: multiplane NoCs duplicate wide physical links and increase routing pressure, whereas conventional Virtual Channel (VC) routers add substantial control complexity, area, and timing overhead. This work revisits this trade-off for modern wide-link NoCs. We evaluate four deadlock-free AXI4 traffic-class separation schemes: a multiplane baseline and three lightweight VC-based designs. Among these designs, we propose Preemptive VCs, a physically-aware architecture that can save up to 76% of link resources with comparable frequency and only 3% router area overhead relative to the multiplane design.

0

cs.AR 2026-07-02

Portable SDR records geotagged IQ data in foliage

by Lawrence Obiuwevwi, Krzysztof J. Rechowicz +6 more

Field-Deployable RF Capture System for Indoor, Outdoor, and Foliage Environments

Battery-powered platform sustains 20 Msps writes with location metadata for real-world spectrum studies

abstract click to expand

Reliable and reproducible radio-frequency (RF) measurements in real-world environments are essential for characterizing spectrum behavior across unlicensed ISM and WiFi bands, licensed mid-band allocations, and emerging next-generation wireless deployments. Existing measurement platforms are often laboratory-grade, cost-prohibitive, or dependent on fixed infrastructure, limiting their practicality for rapid, distributed, or long-duration field campaigns. This paper presents a compact, battery-powered RF capture system integrating a HackRF One software-defined radio, Raspberry Pi 5, GNSS receiver, regulated battery supply, and high-speed solid-state storage. The platform records continuous IQ data at up to 20 Msps in SigMF format with per-segment location and timing metadata for reproducible spectrum analysis. Field experiments at 2.45 GHz in dense foliage, urban outdoor, and indoor office environments reveal distinct propagation signatures. Foliage measurements remain near the noise floor at -76 to -82 dBFS with limited spectral structure, consistent with strong canopy attenuation. Urban measurements show multipath activity across a 30 dB dynamic range, overlapping WiFi channels, and frequent ISM-band interference. Indoor measurements show dominant WiFi channels, an estimated 20 to 25 dB building entry loss relative to outdoor conditions, and an 8 to 10 dB higher interference floor caused by structural reflections. The system sustained 75 to 85 MB/s write throughput with no dropped samples or buffer underruns, while GNSS synchronization remained below one second with meter-level positioning. These results show that a portable, cost-effective SDR platform can produce high-fidelity, geotagged IQ datasets for spectrum characterization, interference analysis, radio environment mapping, and environment-aware wireless research.

0

quant-ph 2026-07-02

Compound pulses shorten trapped-ion schedules for H2 simulation

by Ria Patel, Masoud Hakimi Heris +2 more

Synthesizing Compound Pulse Gadgets for Hamiltonian Simulation on Trapped-Ion Platforms

GRAPE-optimized gadgets cut total duration by skipping gate stitching in QSVT time-evolution circuits.

abstract click to expand

Standard gate-level transpilation introduces significant physical noise and overhead for high-precision quantum algorithms, such as the Quantum Singular Value Transformation (QSVT), on near-term trapped-ion hardware. Current compilers treat quantum operations as discrete units, forcing the physical control layer to execute highly fragmented laser pulses. To address this hardware-software disconnect, this work introduces a holistic pulse synthesis strategy that bypasses discrete gate-stitching to compile algorithms directly into continuous compound pulse gadgets. As a proof-of-concept, we target Hamiltonian simulation of the $H_2$ molecule, block-encoding the problem into a QSVT circuit to approximate the time-evolution operator $U = e^{-i H t}$ across 3 computational ions (2 system, 1 ancilla). We utilize the Gradient Ascent Pulse Engineering (GRAPE) algorithm to generate these compound gadgets and evaluate our methodology using noisy Lindblad master equation simulations. Preliminary observations indicate that the proposed strategy achieves significant temporal compression, reducing the total pulse schedule duration compared to standard compilers. Furthermore, synthesizing operations holistically eliminates the control-layer latency associated with discrete pulse lookup overhead. By streamlining the physical control schedule, this methodology offers a promising pathway to execute operations faster, highlighting the potential for compound gadgets to increase the computational depth achievable within fundamental $T_2$ decoherence limits.

0

cs.AR 2026-07-02

Redundant arithmetic removes corrections from NTT hardware

by George Alexakis, Dimitrios Schoinianakis +1 more

High-Performance NTT Accelerators for PQC leveraging Unified Redundant Arithmetic and Fine-Tuned Microarchitecture

New representation eliminates conditional steps in Montgomery operations and folds scaling into butterfly units for faster FPGA results.

abstract click to expand

Post-quantum cryptography and privacy-preserving technologies are expected to play a central role in future secure communication systems. Lattice-based PQC schemes such as ML-KEM (CRYSTALS-Kyber) and ML-DSA (CRYSTALS-Dilithium) rely heavily on large-degree polynomial arithmetic, making the Number Theoretic Transform (NTT) a key computational primitive. Although existing hardware accelerators exploit parallelism and pipelining to support both NTT and INTT, their efficiency is often limited by the overhead of modular reduction and correction steps, inverse-transform scaling operations, and suboptimal FPGA implementations. This work addresses these limitations by proposing parallel iterative NTT/INTT accelerators based on optimized unified butterfly units. We introduce a novel redundant number representation that eliminates conditional corrections for both Montgomery modulo multiplication and combined subtract-multiply operations, and integrate inverse-transform scaling into existing arithmetic hardware to avoid dedicated scaling units. Furthermore, we design hierarchical Montgomery multipliers that map efficiently onto FPGA DSP resources, reducing hardware cost while enabling high operating frequencies. FPGA-based experimental results demonstrate higher clock frequencies, reduced execution times, and competitive resource utilization, supporting efficient NTT acceleration for PQC and related privacy-preserving applications.

0

cs.AR 2026-07-01

FPGA accelerator speeds ViT layers up to 2.74x on edge boards

by Hubert Dymarkowski, Xingjian Fu +3 more

FlexViT: A Flexible FPGA-based Accelerator for Edge Vision Transformers

One reconfigurable GEMM engine plus depth-first tiling handles mixed fully-connected and convolutional layers without extra memory traffic.

abstract click to expand

Deploying Vision Transformer (ViT) models on edge platforms remains challenging due to their high computational demands and the architectural heterogeneity of modern hybrid ViT models, which incorporate both fully connected and convolutional layers. This heterogeneity leads to significant variation in tensor shapes, requiring flexible and efficient FPGA-based acceleration. In this paper, we present FlexViT, a reconfigurable FPGA accelerator for efficient ViT inference on resource-constrained edge devices. Built on the SECDA-TFLite framework, FlexViT employs a hardware-software co-design approach that maps both fully connected and convolutional layers onto a unified high-throughput INT8 GEMM engine using a runtime im2col transformation. To efficiently support diverse layer configurations, we propose a dual-mode dataflow that dynamically switches between input and weight reuse by reconfiguring the compute array at runtime. We further introduce a depth-first tiling strategy that completes accumulation in a single pass, eliminating off-chip partial-sum transfers and reducing memory bandwidth requirements. We implement FlexViT on a PYNQ-Z2 FPGA and evaluate it across a representative set of ViT models. FlexViT achieves up to 2.74x speedup on accelerator-executed layers, translating into up to 1.40x end-to-end speedup compared to CPU-only execution. The code is available at: https://github.com/gicLAB/FlexViT

0

quant-ph 2026-07-01

Buffer-relay fabric eliminates data-atom moves in neutral-atom chips

by Chen Huang, Jingbo Wang +6 more

Lazy-Move Compilation for Neutral-Atom Quantum Computers via a Buffer-Relay Fabric

Static backbone in dual-species arrays delivers 10x fidelity and 500x speed gains over prior compilers while cutting transport events to zer

abstract click to expand

Neutral atom quantum computing offers strong scalability and flexible qubit connectivity, but most existing compilation flows rely on reconfigurable atom arrays that physically shuttle qubit atoms during execution. Although this approach improves connectivity, it also introduces handoff errors, motional heating, and atom-loss risks that can degrade overall fidelity. We present BRIDGE, a Buffer-Relay Interconnect for Data-stable Gate Execution that co-designs a static, compiler-managed buffer-relay fabric with a lazy-move compiler that exploits it. BRIDGE targets an optimized, dual-species 2D interleaved atom array, using non-encoding ``buffer atoms'' to mediate long-range interactions in the fixed baseline and introducing limited data motion only for selected hotspots. By using calibrated heteronuclear and homonuclear Rydberg channels, BRIDGE realizes a static routing backbone in which data-buffer and buffer-buffer interactions are enabled while residual data-data crosstalk is suppressed. Across a 22-circuit matched benchmark suite re-estimated under a single shared error model, BRIDGE attains a geometric-mean $\sim$10$\times$ higher total fidelity than ZAP and $\sim$16$\times$ than Enola, together with $\sim$540$\times$ and $\sim$1000$\times$ lower circuit execution time, respectively, while reducing data-atom movement from thousands of transport events to zero.

0

cs.CR 2026-07-01

FPGA parallelism leaks ML-KEM keys despite higher-order masking

by Davis Ranney, Yashaswini I Makaram +2 more

Exploring Side-Channel Protections in Hardware Implementations of PQC ML-KEM Verification

Experiments recover full secret keys from masked verification on FPGAs via first-order leakage created by parallel processing.

abstract click to expand

As ML-KEM is adopted as a post-quantum cryptographic standard, resilience against physical side-channel attacks has become essential. Among the constituent steps, the decapsulation Fujisaki-Okamoto (FO) verification is particularly vulnerable to side-channel power and electromagnetic (EM) analysis. In this work, we focus on common FPGA-based implementations and examine their side-channel vulnerabilities, and compare them with those of microcontroller implementations. Three verification implementations, unprotected, hash-based (first-order), and higher-order masked, are evaluated for side-channel security on both a microcontroller and an FPGA. While FPGAs offer higher speed and parallelism, they often exhibit stronger side-channel leakage, especially in high bandwidth configurations. The higher-order masked designs still leak information about the underlying data due to hardware-level effects and data-dependent processing. Our experiments show that their parallelized processing on FPGAs introduces sufficient first-order leakage for full secret-key recovery. These results underscore the persistent challenge of securing PQC algorithms in performance-constrained and parallelized hardware environments.

0

cs.AR 2026-07-01

In-situ memristive indexing achieves 4.7-7.8x higher throughput

by Bing Wu, Xueliang Wei +7 more

In-situ Indexing via Memristive Content-Addressable Memory

Ultra-large logical buckets and in-memory moving remove most collision-resolution and resizing costs in hash tables.

abstract click to expand

Processing-in-Memory (PIM) is a proven paradigm for overcoming the ``memory wall". However, while data indexing is severely bottlenecked by this same wall, it remains unclear how indexing can effectively benefit from PIM's unique capabilities. We present PATH, an in-situ indexing architecture that bridges this gap by leveraging the massive parallelism and inherent data-movement of PIMs. Specifically, we first reformulate the fundamental indexing operations, namely Insert, Search, Update, and Delete, into highly parallel in-situ content-addressable memory operations executed directly within memory arrays. Taking hash indexes as a typical case, we elaborate how PATH breaks the inherent trade-off among memory accesses, load factor, and process latency in conventional hashing schemes. By adopting ultra-large logical buckets and in-memory moving, PATH virtually eliminates the cost of hash collision resolution and significantly reduces resizing overhead. Compared with state-of-the-art schemes, PATH achieves $4.7-7.8\times$ higher throughput, $>14.5\times$ lower tail latency, and $>61.4\%$ fewer memory accesses under insertions, laying a scalable foundation for next-generation data-centric computing.

0

cs.AR 2026-07-01

PEERS computes exact resistances on 1M-node graphs in 18.8 seconds

by Baiyu Chen, Lin Gan +2 more

PEERS: A Parallel and Exact Effective Resistance Solver via Implicit Inversion and Augmented Symbolic Analysis

Implicit Cholesky inversion lets the solver answer all-edge queries exactly in one parallel sweep and scale to 17M nodes in under an hour.

abstract click to expand

High-precision effective resistance computation is a cornerstone of Electronic Design Automation (EDA) sign-off, yet it remains a fundamental bottleneck in large-scale power grid analysis, spectral sparsification, and circuit reliability. Existing approaches face a prohibitive "precision-memory impasse": approximate methods lack the stringent accuracy required for high-stakes industrial sign-off, while exact methods either suffer from redundant query overheads or trigger $O(n^2)$ memory explosions. To resolve this, we propose PEERS, a Parallel and Exact Effective Resistance Solver powered by an implicit inverse computing model of the Cholesky factor. By integrating a state-inherited augmented depth-first search (DFS) with a dynamic query update mechanism, PEERS eliminates numerical redundancy and evaluates all-edge resistance queries in a single parallel sweep. We provide a rigorous Work-Span analysis, proving that for graphs satisfying an $O(n^\alpha)$ separator theorem, PEERS achieves a theoretically optimal parallel span of $O(n^\alpha)$ while strictly maintaining $O(nnz(L))$ space complexity. Numerical evaluations on industrial benchmarks demonstrate that PEERS achieves an average speedup of 83.3x over state-of-the-art parallel solvers under identical memory constraints. Notably, PEERS processes a 1-million-node industrial graph in just 18.8 seconds and scales to 17 million nodes in under an hour, providing the first computationally feasible path for exact all-edge resistance analysis in multi-million-gate designs.

0

cs.AR 2026-07-01

Dynamic bit precision boosts FPGA CNN energy efficiency 82%

by Muhammad Usman, Malik Zohaib Nisar +2 more

MINT: Dynamic-Precision CNN Inference with MSDF Digit-Serial Arithmetic on FPGA

MINT selects 5-6 bit layers via greedy search for VGG-16 and ResNet-18 on Zynq-7020 with 2% accuracy tolerance

abstract click to expand

We present MINT, a dynamic-precision CNN inference accelerator based on left-to-right (LR) arithmetic. LR arithmetic computes in most-significant-digit-first manner and exposes useful partial results early so that the computation can be terminated once the desired precision is achieved. At the core, there is a MSDF serial-parallel inner-product unit, which uses redundant signed-digit representation to compute each convolution window. A budget-constrained greedy search profiles all convolution layers from INT2 to INT7 and selects the lowest precision per layer while constraining total accuracy loss to within 2\% of the INT8 baseline for VGG-16 and ResNet-18 networks. The design is synthesized on a Xilinx Zynq-7020 at \SI{200}{\mega\hertz}, and uses 5.64 average bits for VGG-16 and 6.04 for ResNet-18, while achieving 19.86 GOPS and 29.51 GOPS/W on VGG-16, and 18.86 GOPS and 26.40 GOPS/W on ResNet-18. This corresponds to 32.6\% and 26.0\% higher throughput and 82.10\% and 62.90\% higher energy efficiency than INT8 with only 1.81\% and 1.96\% drops relative to the INT8 baseline. Compared with representative prior FPGA CNN accelerators considered in this study, MINT delivers the highest energy efficiency among the listed VGG-16 and ResNet-18 designs on Zynq-7020 platform.

0

cs.AR 2026-07-01

MSDF adders cut FPGA resources 2.5x for ultrasound beamforming

by Muhammad Usman, Shujaat Khan +1 more

Dynamic Ultrasound Beamforming Using Left-to-Right Arithmetic Adders on FPGA

Early termination yields diagnostic images, lowers power 23 percent, and raises frame rate 80 percent on small devices.

abstract click to expand

Adder trees are the computational backbone of delay-and-sum (DAS) ultrasound beamforming, where their implementation directly determines the energy, throughput, and area of a real-time imaging pipeline. Conventional parallel adder trees perform full-precision combinational reduction on every sample, leading to wide critical paths, high LUT consumption, and timing failures on small FPGA devices. This paper presents an alternative adder tree architecture based on \emph{left-to-right (LR)} or \emph{most significant digit first (MSDF) arithmetic}. We implement the proposed and conventional adder trees on a Xilinx Zynq XC7Z010 FPGA and evaluate them for DAS beamforming of a 64-channel ultrasound dataset. The proposed design uses 2.5$\times$ fewer LUTs than the smallest conventional tree, successfully meets the timing constraint, and consumes 23\% less dynamic power than the most efficient conventional baseline. A key advantage of the proposed MSDF adder tree is that it can generate high-quality beamformed images without waiting for full-precision completion. This naturally enables dynamic precision at runtime with negligible control overhead, since precision selection is achieved simply by stopping the computation clock after the desired number of cycles. Such quality--energy scalability is fundamentally unavailable in conventional fixed-cycle adder trees. Iso-area replication enables up to 15 parallel instances on the XC7Z010, achieving 67 FPS, which is 80\% higher throughput than the best conventional design.

0

cs.AR 2026-07-01

Single-level spectral method matches multilevel hypergraph cuts at linear scale

by Rongjian Liang, Zhuo Feng +1 more

HySpecPro: Scalable Hypergraph Partitioning via Spectral Projection Optimization

HySpecPro avoids coarsening distortions by optimizing directly in bipartite Laplacian embeddings for large VLSI designs.

abstract click to expand

Modern VLSI designs comprise tens of billions of components, making scalable hypergraph partitioning critical for parallel and hierarchical optimization. Although multilevel partitioning remains the dominant paradigm, its coarsening stage can distort structural information, especially in hypergraphs with many high-degree hyperedges, leading to increased refinement overhead and limited scalability. Recent approaches incorporate spectral information to guide coarsening, but only in a heuristic manner, without directly optimizing the partitioning objectives. We introduce HySpecPro, a single-level hypergraph partitioner that performs end-to-end optimization in a spectral embedding space. HySpecPro constructs embeddings from a bipartite Laplacian and performs efficient projection-based search, supported by a fully GPU-accelerated implementation. Experiments show that HySpecPro delivers cut quality comparable to state-of-the-art multilevel methods while scaling linearly with the total hyperedge degree.

0

cs.AI 2026-06-30

Multi-agent system refactors to HLS with 6.51x speedup

by Yang Zou, Zijian Ding +2 more

AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

Self-evolving memory lets AgRefactor beat prior tools on programs five to ten times longer while using under 20 percent extra resources.

abstract click to expand

High-Level Synthesis (HLS) provides a fast path from concepts to silicon, but converting real-world software into synthesizable HLS code remains challenging due to restrictive language support and the gap between software and hardware programming practices. Existing automated and LLM-based refactoring approaches partially address this problem, yet they often lack flexibility, struggle to scale, and incur high computational costs. We introduce AgRefactor, an LLM-based multi-agent workflow for refactoring software into HLS-compatible programs. AgRefactor incorporates a self-evolving memory system that accumulates and retrieves factual and strategic knowledge across tasks, improving robustness and efficiency on unseen programs. To reduce cost and enhance scalability, it integrates automated refactoring tools, enabling agents to balance LLM-driven rewrites with efficient tool-based transformations. On 9 out of 11 challenging real-world benchmarks, which are 5-10x longer than the most complex cases studied in prior work, AgRefactor outperforms or matches the state-of-the-art automated refactoring tool and a strong LLM-based baseline built on the same framework backbone. Further agentic performance optimization yields a 6.51x geometric mean speedup over the SoTA pragma tuning tool and a 1.20x speedup over optimized open-source designs with less than 20% extra resources. AgRefactor is fully-automated and open-sourced.

0

cs.AR 2026-06-30

SpikON cuts SNN training latency 32 percent and speeds it 7-27x on edge chips

by Peilin Chen, Xiaoxuan Yang

SpikON: A Dual-Parallel and Efficient Accelerator for Online Spiking Neural Networks Learning

Algorithm-hardware co-design adds learnable thresholds and cascade reuse to make online supervised spiking networks practical at the edge.

abstract click to expand

Spiking neural networks (SNNs) have emerged as a promising paradigm for energy-efficient brain-inspired computing. However, existing online unsupervised SNN learning suffers from low training accuracy and poor scalability. Although current online supervised learning algorithms perform well on large-scale datasets and networks, the non-hardware-friendly operations hinder efficient edge deployment. In this work, we propose SpikON, the first algorithm-hardware co-design framework for efficient and scalable end-to-end online supervised SNN learning. We first propose the learnable threshold through time and scaled weight centralization through time techniques to address the inefficiency of traditional algorithms. Moreover, to reduce latency and energy consumption, we introduce the novel training dataflow and cascade computation reuse scheme for SNNs that allows concurrent forward-backward computation and temporal reuse across timesteps. We further design the dedicated SNN accelerator with a dual-parallel engine and customized SIMD-based SNN core for efficient end-to-end online learning. Experiments show that the SpikON algorithm achieves 32.2% and 35.0% reductions in training latency and energy consumption over the baseline, without sacrificing accuracy. Moreover, the SpikON co-design achieves 7.2x (11.5x) and 26.8x (15.8x) training throughput (energy efficiency) compared with the edge Apple M4 GPU and TPU-like accelerator, respectively. The code is available at https://github.com/peilin-chen/SpikON.

0

quant-ph 2026-06-30

CryoZip cuts QEC syndrome data up to 48x at 4 K

by Guanchen Tao, Alexander Knapen +5 more

CryoZip: An Efficient Cryogenic Compressor for Quantum Error Correction Syndromes

Compressor paired with predecoder delivers over 14,000x bandwidth reduction and 42x energy savings across the cryogenic interface.

abstract click to expand

Scaling fault tolerant quantum computing is increasingly constrained by the limited bandwidth and power budget across the 4 K to room temperature (RT) interface. We present CryoZip, a cross stack cryogenic compression framework that cooperates with a lightweight cryogenic quantum error correction (QEC) predecoder to reduce 4 K to RT syndrome transmission under realistic, circuit level noise. CryoZip targets sparse syndrome vectors with a sliding window compression architecture sized under strict decoding latency constraints to maximize energy efficiency. We implement and evaluate the design in 22 nm FDSOI characterized at 4 K, using vector based power, performance, and area analysis to obtain realistic hardware data. CryoZip achieves up to 48x compression, 1.8x higher than state of the art compressors, across various QEC codes while delivering 4 to 26x energy savings. When paired with a QEC predecoder, it yields over 14,238x bandwidth reduction, while energy savings rise to 42x when accounting for realistic QEC interface overheads.

0

cs.AR 2026-06-30

COSM boosts PIM speed 2.8x on mobiles by using CPU idle slots

by Yilong Zhao, Fangxin Liu +5 more

COSM: A Cooperative Scheduling Framework for Concurrent PIM and CPU Execution on Mobile Devices

The framework adds a low-interference interface and idleness-aware scheduling to let PIM and CPU share memory with under 2% CPU impact.

abstract click to expand

The development of on-device large language models (LLMs) is driven by the need for privacy and fast response times. Energy-intensive data transfer on mobile devices makes Processing-in-Memory (PIM) an effective solution. Due to stringent DRAM cost constraints, limited physical footprint on circuit boards, and the interaction between applications and LLMs, it is imperative for the CPU and PIM to operate concurrently within a shared memory space. However, challenges such as bank conflicts and bus congestion can arise, potentially diminishing the performance and energy benefits of PIM. To address this challenge, we introduce COSM, a cooperative scheduling framework designed to facilitate the concurrent operation of PIM and CPU tasks on mobile platforms. Our key innovations include: 1) a low-interference PIM control interface that generates the maximum number of PIM commands without disrupting CPU memory accesses; 2) an idleness-aware scheduling method that integrates PIM commands into available idle time windows within the CPU's access sequence. COSM not only hides PIM execution latency from the CPU, but also overlaps PIM execution with data transfer. Experiments on concurrent execution of LLMs and mobile workloads, including mobile applications and compute-intensive kernels, demonstrate that COSM improves PIM throughput by up to 2.8x compared to the baseline scheduling method with less than 2.0% CPU performance loss.

0

eess.SY 2026-06-30

Harmonic-corrected MPCC lowers single-phase EV charger THD to 2.85%

by Changhong Li, Bharathkumar Hegde +2 more

Model Predictive Current Control with Harmonic Correction for Single-Phase AC-DC EV Charging

Duty cycle prediction plus real-time estimation suppresses low-order harmonics in OBC simulations for better grid current quality.

abstract click to expand

The increasing integration of Electric Vehicles (EVs) has imposed a growing harmonic challenge on the power grid. For AC/DC Power Factor Correction (PFC) in single-phase On-Board Chargers (OBCs), Model Predictive Current Control (MPCC) improves the current quality by predicting and tracking the inductor current. However, finite control set MPCC selects switching states, resulting in discrete control actions and a limited optimisation space. Moreover, the MPCC cost function based on instantaneous current tracking error has limited capability to compensate for low-order harmonic disturbances induced by dead time, control delay, and model parameter mismatch. This paper proposes a duty cycle predictive MPCC incorporating a real-time harmonic estimation reference. The proposed method dynamically estimates the low-order harmonic components of the input current and corrects the MPCC reference current, enabling continuous duty cycle control and targeted suppression of dominant low-order harmonics. Simulation results on a single-phase OBC demonstrate that the proposed duty cycle predictive MPCC reduces the steady-state current THD_i from 11.47% to 6.10% compared with the switching state predictive MPCC. With the harmonic reference, the THD_i is further reduced to 2.85%.

0

cs.AR 2026-06-30

One-shot pruning cuts FPGA neural net search cost by 20x

by Changhong Li, Biswajit Basu +1 more

RQP: Resource-Oriented Quantiser Pruning for Neural Networks on FPGAs

Method moves networks close to target resources in one step then refines with bidirectional scheduling, preserving accuracy trade-offs.

abstract click to expand

High granularity quantisation (HGQ) exploits weight-level quantisation and pruning to design resource-efficient neural network accelerators, achieving an attractive trade-off between accuracy and hardware utilisation. HGQ is particularly well suited to FPGA-based edge neural network applications. Standard HGQ workflow starts from a high-precision model and progressively reduces bit width, guided by gradient-based optimisation to outline the Pareto frontier. This monotonic and irreversible pruning process is computationally intensive and can overlook the optimal subnetwork for a given resource level. We propose a resource-oriented one-shot quantiser pruning method that brings the network directly close to the target search space, and then use bidirectional beta scheduling for fine-tuning to enable a more refined scan of the Pareto frontier. Validated on the jet substructure classification, JSC, task, our method reduces the search cost by up to 20.58x compared with monotonic resource reduction in standard HGQ workflows, while achieving a competitive Pareto frontier and final network configuration.

0

cs.AR 2026-06-30

22 nm SNN chip hits 0.375 pJ/SOP

by Rick Luiken, Manil Dev Gomony +1 more

Mega: A 22 nm Convolutional Spiking Neural Network Accelerator Achieving 0.375 pJ/SOP for Efficient Edge Vision

Parallel 3x3 units and unified memory let the accelerator match varying sparsity without extra overhead.

abstract click to expand

Convolutional Spiking Neural Networks (SNN) offer the potential for highly energy-efficient vision processing by exploiting sparse, event-driven computation. However, existing SNN accelerators underutilize the inherent parallelism of convolutional layers and lack the flexibility to accommodate varying memory demands and input sparsity across layers. This paper presents Mega, a digital architecture for convolutional SNNs that addresses these limitations through three key contributions: (1) highly parallel acceleration of $3 \times 3$ convolutions, (2) a unified data memory for spikes, neuron states, and weights, and (3) efficient spike map processing with low-overhead spike detection. Fabricated in GlobalFoundries 22 nm FDSOI technology, Mega achieves an energy efficiency of 0.375 pJ/SOP, improving the state of the art by $4\times$.

0

cs.AR 2026-06-30

Mixed GDDR and HBM hardware boosts LLM goodput 3.2x

by Zhixiang Wei, Yun Wang +3 more

HBM Is Not All You Need: Efficient Disaggregated LLM Serving across Memory-heterogeneous Accelerators

Phase-wise quantization and deferred dequantization pair cheaper GDDR prefill with HBM decode for higher goodput and lower cost.

abstract click to expand

LLM inference comprises a compute-bound prefill phase and a memory-bound decode phase, and recent systems disaggregate them onto separate hardware. Yet today's datacenter GPUs rely on costly HBM whose bandwidth sits almost entirely idle during prefill. LLM serving across memory-heterogeneous accelerators (MemHA) pairs GDDR-based accelerators for prefill with HBM-based GPUs for decode, promising lower cost without sacrificing performance. Pushed to its most economical form, MemHA serving is inherently cross-vendor, since the best-suited chip for each phase may come from a different vendor. This breaks two assumptions that single-vendor disaggregation takes for granted -- a KV format both ends consume natively, and a shared software stack. We present \textbf{HMA-Serve}, a MemHA-centric disaggregated serving system pairing GDDR-based accelerators for prefill with HBM-based GPUs for decode efficiently. HMA-Serve achieves this through (1) phase-wise quantization, applying vendor-native low precision for high-throughput prefill while keeping decode in high-precision BF16, (2) a compute-transfer pipeline that overlaps each layer's KV cache transfer with later-layer prefill to reduce time-to-first-token (TTFT), and (3) deferred dequantization, shipping raw quantized bytes and reconstructing them lazily on the decode GPU to reduce network bandwidth and HBM usage. Across four Qwen3 models (4B--32B) and three production traces, HMA-Serve delivers up to $3.2\times$ higher goodput than state-of-the-art memory-homogeneous methods and $4.8\times$ higher goodput-per-dollar, with no measurable loss on generation-quality benchmarks.

0

cs.AR 2026-06-29

Idle AI chips run general tasks as neural approximations

by Yihan Wang, Huiru Yan +7 more

Harvesting AI Computation at the Edge via Generic Approximation

A scheduler places NAS-generated models onto edge AI engines only during gaps in primary neural-network work.

abstract click to expand

With the widespread adoption of AI in various IoT scenarios such as smart sensing and processing, AI chips have become a common component at the edge. These chips are typically specialized for structured neural network (NN) processing and are designed to meet peak workload demands. However, they are often underutilized and suffer from considerable computational waste due to temporal or spatial redundancy in processing. Conversely, general-purpose processing engines at the edge may struggle with compute-intensive tasks such as signal processing and complex numerical operations because of stringent resource constraints. To address this imbalance, we propose a framework that harvests unused AI computation resources using general-purpose approximation techniques. The core idea is to automatically convert traditional computing tasks into neural network models via a representative neural architecture search (NAS) method. These approximate versions of general-purpose tasks are then deployed on AI engines during their idle periods. Specifically, we introduce a runtime scheduler that offloads these tasks to AI chips without compromising the performance of primary AI workloads, thereby alleviating the burden on general-purpose processors. Experiments on a representative AIoT processor show that our proposed AI computation harvesting strategy delivers substantial performance improvements across a set of edge processing tasks.

0

eess.IV 2026-06-29

4-way split encoding hits 122 fps real-time for 8K V-PCC on Blackwell GPUs

by Kasidis Arunruangsirilert, Jiro Katto

Performance Analysis of Hardware-Accelerated 10-Bit 4:2:2 Encoding with Split-Frame Encoding for High-Fidelity V-PCC Streaming

Standard GPUs now support the 10-bit 4:2:2 demands of high-density volumetric streaming without custom chips.

abstract click to expand

Video-based Point Cloud Compression (V-PCC) encodes volumetric data by projecting 3D geometry and texture onto 2D video frames. To prevent spatial distortion and color bleeding during 3D reconstruction, this process requires 10-bit color depth and 4:2:2 chroma subsampling, rather than the standard 8-bit 4:2:0 format. Additionally, capturing high-density dynamic point clouds requires demanding encoding parameters, such as 8K resolution at framerates up to 120 fps. Historically, the lack of 4:2:2 chroma support in older GPU hardware encoders restricted real-time V-PCC to custom Application-Specific Integrated Circuits (ASICs). However, the recent introduction of NVIDIA's Blackwell GPU architecture, featuring on-chip hardware encoders with 10-bit 4:2:2 support, presents an opportunity to shift this workload to general-purpose hardware. This paper investigates the feasibility of such an approach. Using a commercially available Blackwell GPU equipped with four parallel on-die hardware encoders as a testbed, we evaluate the throughput, rate-distortion (RD) performance, and power consumption of 8K 10-bit 4:2:2 HEVC across various Split-Frame Encoding (SFE) configurations. Our results demonstrate that 4-way SFE achieves an encoding throughput of 122 fps, successfully meeting the strict real-time constraints of high-density V-PCC. Although the inability to exploit spatial redundancies across slice boundaries results in a BD-Rate penalty of up to 5%, the measured throughput and power efficiency establish standard, commercial off-the-shelf GPUs as a highly viable baseline for real-time volumetric video streaming.

1 0

0

cs.AR 2026-06-29

Chiplet systems gain up to 12.5x throughput by relocating compute contexts

by Arvin Delavari, Leonid Popryho +2 more

SHIFT: Dynamic Compute Relocation Framework for Communication-Aware Chiplet-Based Systems

SHIFT moves entire node state instead of data alone, cutting latency up to 76.8% and improving LLM energy-efficiency 1.8x in simulations.

abstract click to expand

The increasing communication complexity of large-scale heterogeneous systems has motivated runtime methodologies for communication-aware workload placement and routing optimization. These communication limitations are addressed in this paper by proposing SHIFT, a novel topology-agnostic approach that transfers compute node context and data to a more suitably positioned node, rather than only shifting data as in conventional networks-on-chip. The proposed strategy is evaluated on a chiplet-based architecture utilizing a fine-pitch integration platform featuring multiple bandwidth-domains for heterogeneous workloads. The proposed architecture employs multi-layered routing between functional or memory chiplets and utility chiplets, which serve as intelligent nodes for routing and compute relocation. Adaptive scheduling and routing utilize a modified shortest-path algorithm for large-scale systems, complemented by a lightweight ML-assisted policy that infers traffic conditions to improve adaptivity. To establish a performance baseline, the initial assessment uses random instruction vectors and data patterns to evaluate the fundamental capabilities of SHIFT. Simulation results exhibit successful relocations over total trials ranging from 75.2% to 97.9% across configurations, with average latency improvements of 16.4%-62.5% and a maximum of 76.8%. In addition, throughput is improved by up to 12.5x, power dissipation per unit area is reduced by ~8%, energy-per-bit is reduced by up to 58.3%, and performance is improved by 18%. To evaluate efficiency under high logic and data density, the framework was tested on standard LLM workloads. Results exhibit average improvements of 4.9x, 5.9x, and 1.8x in runtime, throughput, and energy-efficiency, respectively, surpassing state-of-the-art wafer-scale LLM services and demonstrating compatibility with large-scale platforms and applications.

0

cs.PF 2026-06-29

Kernel simulator predicts LLM latencies on new GPUs at 12.1% error

by Xiteng Yao, Taeho Kim +8 more

KernelSight-LM: A Kernel-Level LLM Inference Simulator

Cross-generation tier needs no target data and improves 1.8x over roofline for serving predictions.

abstract click to expand

As large language models (LLMs) move into production serving, practitioners must rapidly evaluate inference performance across diverse hardware, models, and serving parameters to meet cost and latency targets. However, the end-to-end behavior of LLMs couples serving-layer policies with low-level GPU kernel execution and rapidly evolving architectures, forcing slow, deployment-specific benchmarking that is hard to generalize. We present KernelSight-LM, a fine-grained inference simulator that models token-level execution and produces kernel-level latency breakdowns. It decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that also captures mechanisms like prefix caching and continuous batching. KernelSight-LM offers two prediction tiers that trade target-GPU data for accuracy. The cross-generation tier uses no target-GPU measurements, only hardware specifications and kernel microbenchmarks from previously profiled GPUs, and predicts per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline (22.0%). A second target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, sharpening per-kernel error to 3.8%, a 7.3x improvement over a comparable baseline (27.7%). Both tiers require far less target-GPU data than the prior systems they extend. In our simulator, these predictions yield end-to-end median (p50) errors across six model families of 15.4%, 12.8%, and 3.0% (TTFT, TPOT, throughput) in the cross-generation tier and 14.3%, 6.2%, and 2.7% in the target-measured tier, matching dedicated profiling tools while collecting far less on-device data. Beyond prediction, its kernel-level bottleneck breakdowns support hardware/software co-design and capacity planning.

0

cs.AR 2026-06-29

Agent evolves hardware designs to 100% benchmark completion

by Cunxi Yu, Chenhui Deng +2 more

Agentic Hardware Design as Repository-Level Code Evolution

HORIZON turns Markdown harnesses into git-managed loops that finish every task in ChipBench, RTLLM, and Verilog-Eval without human input.

abstract click to expand

We present HORIZON, a self-evolving agent framework that treats hardware design as repository-level code evolution. A Markdown harness is compiled into a project pack containing domain knowledge, an executable evaluator, an acceptance predicate, and a git/runtime policy; a hands-free agent loop then evolves an isolated git worktree, using repository operations for state management, tracing, and replay. This extends prior works of repository-scale self-evolution from EDA software systems, to hardware-design artifacts themselves. We evaluate our approach on ChipBench, RTLLM, Verilog-Eval, and nine CVDP categories, achieving 100\% benchmark completion across all suites with a fully hands-free agentic loop. However, we do not claim that agentic AI for hardware design is solved: these benchmarks are controlled proxies for a much broader engineering problem in chip design. Section~\ref{sec:discuss} examines the limitations of the current study and highlights open research challenges.

0

cs.AI 2026-06-29

AI framework automates novel high-tech system design

by Luuk Oerlemans, Steven Westerhof +1 more

AI-Driven Synthesis for High-Tech System Design: Automating Innovation

Computational design synthesis applies deep learning to move from simulation optimisation to autonomous generation with little human oversig

abstract click to expand

This article addresses the combinatorial complexity inherent in modern high-tech system design by presenting automation-in-design (AiD) as a transformative paradigm. We propose computational design synthesis (CDS), a framework utilising deep learning and generative AI to automate the creation of novel systems. Two case studies (e-drive system design and spatial dimensioning problem) serve as proof-points for this approach. The AI-driven methods used in the case studies represent a fundamental shift in engineering, advancing from simulation-based optimisation towards autonomous design with minimal human supervision.

0

cs.CR 2026-06-29

Hardware benchmarks gain tamper-evident records with hash links

by Faruk Alpay, Baris Basaran

Self-Verifying Measurement Records: Hash-Linked Evidence Graphs for Hardware Benchmarking

Each quantity binds to its observation and verification via content hash in an append-only log for offline audit.

abstract click to expand

Performance numbers reported for hardware are accepted on trust: the reader cannot recompute them, the apparatus is gone, and the silicon itself can be silently wrong, with fleet studies reporting on the order of one core in a thousand returning incorrect arithmetic with no error raised. We make a reported hardware measurement a tamper-evident, independently checkable record. Every quantity in the text, a table, or a figure is bound, by its content hash, to the observation and the verification behind it; the whole is a hash-linked, append-only structure (a transparency log for measurement) that a verifier audits offline without trusting its producer. Matrix products are verified by a probabilistic identity (Freivalds) at O(k n^2) cost under a tolerance we derive from floating-point error analysis and calibrate to the device's own measured residual floor, so a wrong product is rejected with probability 1 - 2^(-k); quantities with no such identity carry an algebraic checksum and a measured reproducibility class. We then treat the check itself as a security object: a probe seed committed for offline reproducibility is an attack surface, and a probe-aware adversary can hide a corruption in the probe's null space, fooling even a quorum of bit-identical witnesses, while a Fiat-Shamir challenge derived from the claimed output closes this. Driving the device from an unprivileged tenant's reach, with a di/dt power virus and a thermal soak, neither moves the calibrated tolerance nor produces a silent error, placing the physical-fault threat at the rare defective part or the privileged attacker and marking the boundary at which the record must compose with a hardware root of trust. We demonstrate the construction across Blackwell and Hopper GPUs and report a residual-floor and reproducibility map by precision, size, and device.

0

cs.AR 2026-06-29

NPU phase effects cut mobile VLM energy use 2.5x

by Aryama V Murthy, Yashas N Kotre +4 more

Phase Matters: Characterizing Heterogeneous Vision-Language Inference on a Mobile SoC

Prefill sees 1.64x speedup on Snapdragon while decode gets 1.18x, yielding cooler steady-state operation without throttling.

abstract click to expand

Recent phone-class mobile SoCs expose practical NPU execution paths for on-device vision-language model (VLM) inference, but developers still lack phase-level guidance for mapping VLM pipelines across heterogeneous backends. We present a hardware-in-the-loop characterization of VLM inference on the Qualcomm SM8750 (Snapdragon 8 Elite), covering phase throughput, cache-state effects, 100-run thermal stability, energy, heterogeneous CPU/NPU pipeline configurations, and visual-token-budget sensitivity. Using FastVLM-0.5B as an end-to-end case study, together with encoder-only measurements across four architecture families, we show that phase matters: NPU execution is highly phase-dependent, delivering 1.64x speedup for prefill but only 1.18x for decode, while vision encoders achieve 20-45x speedups over CPU. These gains translate into 10.47 degrees C lower steady-state temperature and 2.52x lower energy, avoiding thermal throttling in always-on settings. Finally, we show that a four-step graph rewrite enables previously unsupported encoders, such as Phi-3.5-V, to reach the QNN path with up to 22x speedup, providing a practical porting recipe for mobile VLM deployment.

0

cs.AR 2026-06-29

Analog KANs with pruning cut area by 55% and power by 50%

by Paula Carolina Lozano Duarte, Georgios Zervakis +2 more

Co-Optimization of Analog Kolmogorov-Arnold Networks for Low-Power Function Approximation in Flexible Electronics

Error-aware training and multi-level pruning enable efficient on-sensor function approximation in flexible electronics for biosignals and ca

abstract click to expand

Wearable devices and Internet of Things (IoT) sensors require on-sensor processing of biosignals and environmental data, including computationally demanding operations such as nonlinear activation functions for neural network inference, sensor calibration curves to map raw readings to physical units, and signal preprocessing functions like logarithmic compression and power operations for feature extraction. These functions exhibit significant complexity, often involving transcendental operations and multivariate dependencies that are costly to implement digitally. Analog function approximation provides a power-efficient alternative by performing these computations in the analog domain, thereby reducing the energy overhead associated with analog-to-digital conversion and subsequent digital processing. Flexible Electronics (FE) present a particularly attractive platform for wearable applications due to mechanical flexibility and low-cost fabrication, but impose strict constraints on circuit density and power consumption, making efficient analog implementations critical but challenging. This work introduces Analog Kolmogorov-Arnold Networks (AKANs), developed via hardware-software co-optimization, to approximate these complex multivariate functions accurately under hardware imperfections. Our method incorporates circuit-level error modeling during training and applies pruning at both software and hardware levels to reduce area and power. Validation across multiple benchmarks demonstrates that our proposed pruning methodology not only reduces hardware cost but can also improve approximation accuracy by regularizing spline parameters. Results show up to 55% area and 50% power savings, with average reductions of nearly 30% across datasets, highlighting AKANs as a robust and generalizable framework for low-power analog function approximation in FE.

0

cs.AR 2026-06-29

SEADA automates precision assignment for mixed DNN accelerators

by Leandro Fiorin, Marco Ronzani +1 more

SEADA: An efficient methodology for optimizing mixed-precision DNNs on multi-precision spatial architectures

Analytical models plus entropy selection enable fast design exploration without full simulations.

abstract click to expand

Mixed-precision computation has been introduced in deep neural networks (DNNs) as an effective approach to reduce latency, energy consumption, and memory footprint. However, efficiently mapping mixed-precision networks onto multi-precision spatial architectures poses several challenges. These include determining the appropriate precision for each layer, balancing layer-wise accuracy sensitivity to quantization against architectural heterogeneity and system-level constraints, and accurately estimating the system-level cost of heterogeneous precision assignments. This work presents SEADA, an efficient methodology designed to address these challenges. SEADA comprises: (i) a configurable system-level analytical cost model of a multi-precision spatial accelerator architecture; (ii) a fast mapping tool that identifies near-optimal mappings of DNN workloads onto the target integer accelerator; (iii) analytical models for floating-point layers to estimate the overall benefits of mixed-precision execution; and (iv) a per-layer precision selection methodology based on bit-level entropy, enabling efficient assignment across multiple numerical precisions. SEADA's efficiency provides designers with a robust framework for the design-space exploration of multi-precision architectures.

0

cs.AR 2026-06-29

LLM judges disagree with humans on hardware schematic quality

by Dhruv Kulkarni, Sai Manoj Pudukotai Dinkarrao

MultModLM: A multi-modal benchmark for Large-Language Model based hardware schematic generation

A benchmark of 99 RTL modules shows models make visually plausible but often non-functional drawings, with LLM evaluators agreeing almost no

abstract click to expand

Recently, Large Language models (LLMs) find application in several fields. This extends to hardware definition and synthesis. However, most works at the intersection of LLMs and hardware generation focus on text-based tasks, creating a gap for multi-modal LLMs for RTL design. In this work, we introduce MultModLM, a benchmark for evaluating LLMs on the task of generating hardware schematics from RTL (Register Transfer Level) descriptions. The dataset consists of 99 diverse RTL modules spanning arithmetic, control, and state-based designs. To address the challenges of non-unique schematic representations, we propose a multi-stage evaluation framework combining rubric-based scoring, self-evaluation, cross-model assessment, blind evaluation, and human validation to enable exhaustive evaluation. Through experiments on state-of-the-art LLMs, we observe that while models can generate visually interpretable schematics, their functional correctness remains constrained. Furthermore, we find that LLM-based evaluators exhibit near-zero agreement with human raters, revealing, as a key finding, that LLM-as-a-judge paradigms are unreliable in structurally precise domains. These findings suggest that reliable evaluation of multi-modal hardware outputs remains an open challenge, motivating the need for more robust and domain-aware evaluation methodologies, as well as tools for structural evaluation, so as to enable formal equivalence checkers.

0

cs.AR 2026-06-26

CHIA turns AI co-design flows into directed cyclic graphs

by Angela Cui, Ferran Hermida-Rivera +10 more

CHIA: An open-source framework for principled, agentic AI-driven hardware/software co-design research

Loops connect simulators, RTL tools, and agents with isolation and fault tolerance for runs across hundreds of heterogeneous systems.

abstract click to expand

Agentic artificial intelligence shows great promise for radically improving the pace of innovation in hardware/software co-design research across computer architecture, systems, compilers, and VLSI. Thus far, however, applications of AI in these contexts have generally been demonstrated in isolated settings on small-scale problems, due to the difficulty of designing and deploying complex AI-infused hardware and software development workflows. This paper introduces CHIA, an open-source hardware/software co-design framework for agile and principled research on the application of AI to co-design. CHIA treats the productive construction and scalable deployment of the co-design flow itself as a first-class objective. In CHIA, agentic AI-driven hardware and software design flows are expressed as CHIA loops: directed cyclic graphs whose nodes execute various system-on-chip design tools, microarchitectural simulators, software build systems, AI models, evolutionary coding agents, and more. The CHIA library provides node implementations for many popular tools, including Chipyard, gem5, ChampSim, FireSim, Hammer (thus several commercial ASIC CAD tools), Vivado, AlphaEvolve, AdaEvolve, and many others. CHIA also provides a broad set of features to conduct principled science around these flows. These include isolation between AI models and hardware tools, profiling mechanisms, fault-tolerant execution, and reliability at scale across hundreds of heterogeneous systems (CPUs, FPGAs, GPUs, etc., across public cloud/on-prem.). To showcase CHIA, we present five CHIA loops as case studies: (1) automatic RTL-to-gem5 simulator alignment, (2) LLM-driven implementation of microarchitectural features in RTL, (3) agentic, IPC-aware critical path optimization, (4) evolutionary architectural discovery, and (5) maintainer-friendly agentic GitHub issue fixing.

0

cs.ET 2026-06-26

Memristor array implements full RV32I set for low-power MCUs

by Liam Splittgerber, Fabian Seiler +1 more

An Instruction Set Architecture for IMPLY-based Memristive Processing-in-Array

IMPLY operations and new addressing let storage and computation share the same non-volatile crossbar.

abstract click to expand

The push towards expanded ultra-low-power edge computing necessitates hardware capable of operating under extremely strict energy constraints. Traditional Complementary Metal-Oxide-Semiconductor (CMOS) microcontrollers are fundamentally limited in this domain by the von Neumann bottleneck and by the static power leakage inherent to volatile memory. Memristive In-Memory Computing (IMC) offers a promising solution to these inefficiencies by unifying data storage and computation into a single non-volatile component. However, the State of the Art (SoA) predominantly focuses on accelerators designed to be a co-processor for data-intensive computation. This leaves the prospect of standalone, general-purpose IMC microcontroller architectures underexplored. This thesis proposes such an architecture tailored for ultra-low-power edge devices, alongside an instruction set closely derived from the RV32I standard. Using the IMPLY stateful logic paradigm, a complete implementation of the proposed instruction set is provided, and the novel addressing schema required to support computation in the memristive crossbar array is described as well. Then, the energy use and other circuit-level metrics of the proposed architecture are evaluated through simulation and compared against those of traditional microcontrollers. Finally, the functional viability of the design is demonstrated through an application case study, describing how the proposed design could be used in an intelligent environmental sensor node.

0

cs.AR 2026-06-26

SPM cuts CGRA memory traffic eightfold

by María José Belda, Lara Orlandic +4 more

Evaluating Architectural Trade-offs in CGRAs: The Impact of Scratchpad Memory and Heterogeneity on Compute-Intensive Kernels

Homogeneous designs also reduce area 4.4x-8.2x and deliver 5x speedup on matrix tasks versus heterogeneous setups at 700 MHz.

abstract click to expand

Modern edge computing applications, particularly high-throughput stream processing like Vision Transformers (ViTs), demand massive spatial parallelism and efficient data movement under tight power and area constraints. Coarse-Grained Reconfigurable Architectures (CGRAs) offer a promising paradigm to balance performance, flexibility, and energy efficiency. This paper analyzes the impact of two critical CGRA design choices: processing element heterogeneity and local data reuse support. We evaluate essential computational kernels (Fast Fourier Transform (FFT) and General Matrix Multiply (GEMM)) alongside an end-to-end seizure detection transformer workload across two distinct configurations: a baseline homogeneous architecture and a heterogeneous evolution integrating specialized functional units with an Scratchpad Memory (SPM). Our evaluation demonstrates that the SPM significantly optimizes data movement, reducing memory traffic eightfold compared to a memory-less design. While the heterogeneous architecture achieves superior energy efficiency for data-shuffling tasks, the homogeneous design minimizes area overhead by 4.4x to 8.2x relative to state-of-the-art CGRAs. Furthermore, it sustains a 700 MHz operating frequency, enabling up to a 5x execution speedup over the heterogeneous configuration during matrix computations. Ultimately, this work provides an architectural roadmap for selecting CGRA fabrics based on the arithmetic intensity, performance goals, and resource envelopes of edge-scale workloads.

0

cs.AR 2026-06-26

M4 Pro GPU leaves cache displacement fixed by one CPU pass

by Faruk Alpay, Baris Basaran

Residual GPU Cache State on Apple M4 Pro

Large GPU memory use slows the next CPU traversal, but a second pass removes most of the penalty, showing predictable residual effects.

abstract click to expand

Apple silicon exposes unified CPU-GPU memory, but the cache state left after a completed GPU command is not documented. This paper characterizes that phase boundary on a 14-core Apple M4 Pro. We validate the measurement pipeline against unmodified STREAM 5.10 and BabelStream 5.0, then adapt an 8192-byte system-level-cache occupancy pattern to a synchronized Metal experiment. A GPU kernel touches 0 to 512 MiB and finishes before a 16 MiB CPU probe begins. The first CPU traversal is slower after large GPU footprints, while a second traversal removes most of the cost, showing residual shared-cache displacement rather than simultaneous DRAM contention. A separate matched-block experiment measures GPU slowdown under high-priority CPU traffic and finds background QoS close to baseline. Root PMU measurements and public IOReport histograms provide hardware grounding: they distinguish L1D refill sectors from software cache-line size, expose page-offset-dependent conflict behavior, and separate performance-core, efficiency-core, and AGX demand. The results identify a reproducible post-GPU cache-displacement window on M4 Pro and quantify a simple one-pass software recovery mechanism.

0

eess.SP 2026-06-26

Deep learning designs Doherty PA with 48-54% back-off efficiency

by Han Zhou, Haojie Chang +2 more

Inverse Design of Compact and Wideband Inverted Doherty Power Amplifiers Using Deep Learning

CNN-GA creates pixelated combiner that merges load modulation, matching and combining over 1.9-2.5 GHz in one GaN layout

abstract click to expand

This paper presents a deep learning-assisted methodology for the inverse synthesis of a compact, wideband inverted Doherty power amplifier (PA). Convolutional neural networks (CNNs) and genetic algorithms (GAs) are jointly employed to generate pixelated Doherty combiner networks that integrate load modulation, impedance matching, power combining, and phase compensation into a single structure. As a proof of concept, we design and fabricate a GaN HEMT Doherty PA with a pixelated output combiner. The prototype achieves a measured peak drain efficiency of 51%-63% and a 6-dB back-off efficiency of 48%-54% over 1.9-2.5 GHz. Within the same frequency range, the measured output power is 44+/-0.3 dBm. Furthermore, with digital predistortion (DPD) applied, the prototype circuit demonstrates an adjacent channel leakage ratio (ACLR) better than -53.2 dBc.

0

cs.AR 2026-06-26

Dynamic scheduling lifts sparse matrix multiply by 1.95×

by Xinrui Wu, Hanyu Wang +2 more

SegFold: Accelerating Sparse GEMM with a Fine-Grained Dynamic Dataflow

SegFold's fine-grained reuse detection and work remapping outperform every static dataflow on matrices of varied density and size.

abstract click to expand

Generalized sparse matrix-matrix multiplication (SpGEMM) is critical in many domains. Existing CPUs and GPUs, as well as specialized accelerators, rely on static dataflows (e.g., inner product, outer product, Gustavson, etc.). Each static dataflow sacrifices some data reuse opportunities and imposes constraints on load balance. To address this inefficiency, we extend the typical SpGEMM dataflows by considering dynamism. Specifically, we add fine-grained dynamic scheduling to optimize reuse and reduce resource contention. We also develop dynamic remapping of partially completed work to improve load balance and parallelism. These ideas are formalized into a specific dataflow called Segment. To demonstrate Segment, we codesign a SpGEMM accelerator called SegFold. SegFold includes a memory controller that identifies fine-grained reuse opportunities in a local window of the stationary input array and exploits them through dynamic work assignment. It also incorporates a merge network that routes inputs to appropriate processing elements (PEs) for computation while dynamically remapping the work assigned to each PE to balance load. Across diverse densities and matrix sizes, SegFold achieves a geometric-mean $1.95\times$ speedup over state-of-the-art SpGEMM accelerators and $5.3\times$ over the best static dataflow configuration, demonstrating that adding dynamism to the dataflow design space unlocks reuse and load-balance gains that no static scheduling choice can achieve in isolation.

0

cs.AR 2026-06-26

LLM agents convert C to synthesizable HLS-C via four-stage verifier

by Zhe Zhao, Hongbing Lang +4 more

Evidence-Driven LLM Agent for C-to-Synthesizable-C Conversion and Verification

Mismatch localization chain and isolated evidence signals let the workflow finish the full pipeline where earlier models stop.

abstract click to expand

Software-compilable C programs routinely fail to complete the four-stage pipeline of a high-level synthesis (HLS) toolchain -- compilation, C simulation (CSim), synthesis, and C/RTL co-simulation (CoSim) -- because HLS accepts only a synthesizable subset of C (HLS-C). Yet most existing large language model (LLM) systems built for HLS code repair only cover the early pipeline stages and feed raw tool logs directly to the model, yielding brittle and hard-to-reproduce fixes. We formulate C-to-HLS-C conversion as a closed-loop generation-verification-diagnosis-repair problem on an HLS tool (Xilinx Vitis), contributing three components: an end-to-end workflow of cooperating agents closed by the four-stage verifier under strict evidence isolation; a Progressive Mismatch Localization Chain (PMLC) that localizes CSim/CoSim mismatches through log normalization, AST backward slicing, and dual-trace instrumentation; and a typed-query, two-stage evidence RAG backed by a self-evolving, family-routed repair-card pool. Experimental results show that the proposed workflow substantially outperforms all comparable state-of-the-art models.

0

cs.AR 2026-06-26

GRAINS runs genome graphs inside SSDs for up to 47.8x speedup

by Nika Mansouri Ghiasi, Harun Mustafa +9 more

GRAINS: Storage-Aware Algorithm-Architecture Co-Design Enabling High-Performance and Low-Cost Graph-Based Genome Analysis

Storage-aware batching and repurposed flash scheduling cut data movement that dominates large genomic graph analysis.

abstract click to expand

Graph-based representations of genome sequences have emerged as a powerful approach for representing massive genomic databases in an expressive and efficient way. Despite their benefits, analysis on large-scale genome graphs incurs significant data movement overhead from the storage system due to accessing large amounts of low-reuse data. Processing data directly inside the storage device can be a fundamental solution for mitigating this overhead. However, none of the existing tools for graph-based genome analysis can be efficiently used inside the storage system due to the limited internal hardware resources in modern SSDs. At the same time, prior storage-centric systems developed for (i) traditional, linear non-graph-based genome analysis or (ii) conventional, non-genomic graph analysis are not suitable for the unique data structures and access patterns of graph-based genome analysis. We propose GRAINS, the first system for analysis with large-scale genome graphs in storage. Through our detailed examination of typical analysis pipelines that operate on genome graphs, we perform storage-aware algorithm-architecture co-design to (i) make these pipelines more storage-friendly and (ii) further improve performance, energy-efficiency, and cost via in-storage and in-flash processing. GRAINS's co-design is based on three key aspects. First, we propose a new batching and execution flow, based on unique features of genome graphs. Second, via in-flash and in-storage processing, we avoid transferring low-reused flash pages. Third, to leverage the full parallelism of flash dies, we design an effective, yet lightweight, scheduling technique, enabled by re-purposing the existing SSD structures. GRAINS provides 2.7x-47.8x speedup (4.4x-31.6x energy reduction) over the state-of-the-art software baselines, and 1.5x-17.0x speedup (3.1x-20.7x energy reduction) over a hardware-accelerated baseline.

0

cs.AR 2026-06-25

NEMS mechanisms add physical security to chip packaging

by Himanandhan Reddy Kottur, Pavanbabu Arjunamahanthi +4 more

Nanoelectromechanical Systems (NEMS) for Hardware Security in Advanced Packaging

They use mechanical variability for tamper detection and low-power authentication where digital methods are vulnerable.

abstract click to expand

As hardware security threats escalate across semiconductor manufacturing and advanced packaging, there is a growing need for novel physical mechanisms to counter sophisticated attacks such as tampering, counterfeiting, and supply chain infiltration. This paper presents Nanoelectromechanical Systems (NEMS) as an emerging class of hardware security primitives that enable physical assurance, tamper detection, and authentication at the device level. Leveraging mechanisms such as NEMS-based Physically Unclonable Functions (PUFs), shape memory materials, resonance-based fingerprints, and physical unlocking architectures, these systems offer enhanced resilience to reverse engineering, side-channel attacks, and environmental degradation. By harnessing mechanical unpredictability and fabrication-induced nanoscale variability, NEMS technologies introduce a physically robust and low-power alternative to conventional digital security methods. Their seamless integration into standard semiconductor workflows paves the way for scalable, verifiable, and secure solutions across defense, aerospace, critical infrastructure, and consumer electronics.

0

cs.DB 2026-06-25

CVM cost calibration recovers up to 48 percent performance

by Qihan Zhang, Mengyuan Li +1 more

Query Cost Model Calibration in Confidential Virtual Machines

By modeling data movement and translation overheads with simple proxies, the adjusted optimizer narrows the gap with standard VMs and someti

abstract click to expand

With the growing adoption of Confidential Computing, running databases in confidential virtual machines (CVMs) such as AMD SEV-SNP has become an attractive way to protect sensitive cloud data with minimal changes to legacy DBMSs. However, analytical queries in such CVMs often suffer substantial overhead, and prior database work has largely stopped at benchmarking these slowdowns rather than optimizing them. We show that this problem stems from a hardware-software mismatch: query optimizers still rely on KVM-oriented (non-encrypted VM) cost assumptions that no longer hold in CVMs. To address this, we propose a lightweight CVM-aware cost calibration. It models two dominant sources of optimizer-facing overhead: data movement and RMP-related translation using simple physical proxies already available to the optimizer. Experiments show that the calibration significantly narrows the KVM/CVM performance gap, recovering up to 48 percent performance and even outperforming the KVM baseline on some workloads.

0

cs.LG 2026-06-25

Auto-computes validated speed-of-light bounds from model code

by Qijing Huang, Sana Damani +10 more

SOLAR: AI-Powered Speed-of-Light Performance Analysis

LLM translates source to intermediate form for analytical computation of theoretical minimum execution times at varying detail levels.

abstract click to expand

How fast could a deep-learning model run on target hardware, and how far is today's implementation from that limit? These questions are central to software, hardware, and algorithm optimizations. Speed-of-Light (SOL) analysis answers them by computing a workload's theoretical minimum execution time on a given architecture. Yet deriving SOL bounds remains manual, error-prone, and disconnected from rapid model development. To close this gap, we introduce SOLAR, a framework that automatically derives validated SOL bounds from PyTorch and JAX source code. SOLAR leverages both generative and deterministic components in its flow: an LLM frontend translates any source programs into an executable Affine Loop IR, validated by output comparison; a deterministic flow lifts the IR into an einsum graph; and an analytical backend computes unfused, fused, and cache-aware SOL bounds. SOLAR provides comprehensive operator and language coverage, produces validated bounds with zero observed SOL violations, and offers multi-fidelity analysis that tightens bounds and surfaces optimization insights. We evaluate SOLAR across KernelBench, JAX/Flax models, and robotics workloads. These experiments demonstrate four use cases: headroom analysis at multiple fidelity levels, identifying optimization opportunities, cross-platform exploration, and inverse-roofline hardware provisioning.

0

cs.AR 2026-06-25

CVA6-RT achieves 12-cycle interrupt latency on open-source RV64 core

by Enrico Zelioli, Christopher Reinwardt +5 more

CVA6-RT: an Open-Source Time-Predictable RV64 Processor for Mixed-Criticality Systems

TLB locks, reconfigurable scratchpads and hardware context stacking bound timing variability for mixed-criticality use

abstract click to expand

This work presents CVA6-RT, a real-time micro-architectural extension of the CVA6 core to bound worst-case latency and reduce task's timing execution variability. CVA6-RT implements the rv64gch ISA and features advanced support for real-time execution, including TLB partitioning and locking for predictable address translation, a dynamically reconfigurable scratchpad mode in the L1 caches for deterministic memory access, and low-latency interrupt handling via an enhanced interrupt controller combined with hardware-assisted context stacking. With real-time features enabled, CVA6-RT achieves an interrupt latency of 12 cycles, comparable to that of simpler Arm Cortex-M microcontrollers, and 10x lower than the baseline CVA6 core.

0

cs.AR 2026-06-25

Firmware thermal hints pre-position rails in 3.5D packages

by Chi Fei Chung (Dollarchip Technology Inc.), Nikolai Nedovodin (STARGA Inc.)

Toward Mitigating Process-Induced Performance Degradation in 3.5D Heterogeneous Packages via Pre-Silicon Firmware Co-Optimization

20-50 ms look-ahead yields R squared 0.9911 correlation and keeps spectral drift inside 21 percent of tolerance.

abstract click to expand

This paper presents a pre-silicon analysis of XRM-SSD V24/V7.0, a physics-aware predictive firmware scheduling layer for Intel's 3.5D heterogeneous integrated packages (Foveros Direct 3D + PowerVia + EMIB-T + UCIe + HBM5). Using detailed thermal-electrical co-simulation over a 90,000-step LLM inference dataset, we show that proactive workload-density-driven thermal hinting (20-50 ms look-ahead) enables pre-positioning of PowerVia voltage rails. Key results include a thermal-load correlation of R^2 = 0.9911, compensated CPO spectral drift below 0.36 nm (21% of TSMC tolerance budget), and HBM leakage current clamped below 1 MB/hr across all load states. Monte Carlo analysis (N=2,000 trials) confirms robustness under process variation. V7.0 extends the framework to multi-tile architectures with an N x N thermal coupling matrix and two-pole kernel. The approach demonstrates potential for 20-30% released compute and 65-68% EDA guard-band reduction. All metrics are engineering projections from pre-silicon characterization. Silicon validation on Intel 18A platforms is pending. This work highlights firmware-hardware co-optimization as an effective approach to mitigating physical limits in advanced 3.5D packaging.

0

cs.AR 2026-06-25

Open-source RISC-V flow produces working student microcontrollers in silicon

by Enrico Zelioli, Philippe Sauter +6 more

Croc: Training the Next Generation Chip Designers on Domain-Specific End-to-End Open Source Silicon

Croc platform supported 65 students across 33 projects, five tapeouts, and one characterized baseline chip with metrics matching closed-sour

abstract click to expand

The demand for domain-specific systems-on-chip (SoCs) in artificial intelligence, robotics, and automotive systems is increasing the need for engineers with hands-on expertise on very-large-scale integration (VLSI) design from architecture specification to fabricated silicon. Yet, most VLSI courses rely on restrictively licensed electronic design automation tools and process design kits (PDKs), as well as closed-source hardware designs. We present an end-to-end open-source domain-specific SoC design and fabrication flow built around Croc, a highly customizable RISC-V platform. Built from open-source SystemVerilog intellectual property blocks and integrated with an end-to-end open-source design flow in a 130nm open PDK, Croc enables tapeout projects supporting multiple domain customization options: instruction-set extensions, accelerator co-processors, and peripherals. In our first open-source course experience using Croc, 65 students completed 33 projects, 30 of which produced manufacturable layouts. 18 designs were selected as tapeout candidates, and five were fabricated. A first baseline chip has already been successfully characterized in silicon, demonstrating microcontroller-class functionality and implementation metrics comparable to those of products with similar functional complexity completed with closed-source toolchains and PDKs.

0

cs.AR 2026-06-25

Merged MSDF multiply-add cuts FPGA energy for U-Net by 9x

by Muhammad Usman, Yousef Sadegheih +1 more

Energy-Efficient CNN Acceleration with MSDF Digit-Serial Arithmetic on FPGA

Single unified pipeline replaces separate startup delays and reaches 15.14 GOPS/W versus 1.93 GOPS/W on CPU.

abstract click to expand

This paper presents an energy-efficient hardware acceleration of the convolutional layers in the U-Net architecture for image segmentation, implemented on FPGA. While digit-serial arithmetic, particularly most-significant-digit-first (MSDF) techniques, offers a compact hardware footprint, it suffers from initial latency before producing the first output digit. This delay accumulates in cascaded operations like multiplication followed by addition, where each unit introduces its own startup overhead. To overcome this, we propose a merged multiply-add (MMA) architecture that fuses these operations into a unified pipeline. Instead of incurring separate delays, the MMA introduces a single streamlined latency per iteration, shorter than the combined latency of conventional cascaded units, resulting in enhanced throughput and efficiency. The MMA units are designed to process spatial input depths in parallel, achieving significantly higher performance than both standalone MSDF-based and conventional designs. We evaluate the proposed design using U-Net as a target application. Despite operating at a lower frequency than a CPU, the FPGA-based accelerator achieves up to an order of magnitude higher energy efficiency, delivering up to $15.14$ GOPS/W compared to $1.93$ GOPS/W for CPU-based inference. The design also shows approximately $9\times$ reduction in energy consumption compared to MSDF-based FPGA implementations. These results highlight the efficacy of the merged arithmetic approach for resource-constrained, latency-sensitive edge applications in medical imaging and computer vision.

0

cs.AI 2026-06-25

Multi-agent engine evolves compression outperforming human designs

by Jiangwei Zhang, Wen Sun +8 more

Agentic evolution of physically constrained foundation models

Knowledge-graph guided search yields methods that cut memory use 75 percent while holding accuracy loss to 0.64 percent on a 235B model.

abstract click to expand

Artificial intelligence increasingly drives automated scientific discovery, yet contemporary generalist agents lack physical grounding, frequently hallucinating hardware-incompatible designs. Here, we present a physically grounded, multi-agent discovery engine that autonomously architects hardware-compliant computing systems. Anchored by an Evolutionary Knowledge Graph structuring past scientific innovations, the framework extracts an "algorithmic Chain-of-Thought" to transform blind stochastic search into directed structural evolution. Applied to the extreme testbed of foundation model deployment, the engine evolved two hardware-aware compression methodologies surpassing human-engineered heuristics: Q-Enhance mitigates long-context accuracy loss in dense models, and MoE-Salient-AQ outperforms state-of-the-art manual sparse Mixture-of-Experts designs by 3.7% at sub-3-bit regimes. Utilizing a bandwidth-efficient Sensitivity Profile, we successfully deployed a massive 235-billion-parameter model onto a constrained dual-A100 server, reducing memory requirements by 75% with a marginal 0.64% accuracy degradation. By transforming unconstrained combinatorial search into knowledge-driven autonomy, this establishes a scalable hardware-software co-design paradigm for machine-driven discovery within strict physical boundaries.

0

cs.AR 2026-06-25

CPU caches enable up to 13.9x faster LLM inference

by Wanning Zhang, Tongzhou Gu +3 more

Cache-Resident LLM Inference in GB-Scale Last-Level Caches

Decoupled weight and attention domains keep model weights on-chip in GB-scale last-level caches.

abstract click to expand

Large language model (LLM) inference is increasingly dominated by data movement across the memory hierarchy. Recent 3D-stacked cache technologies have enabled GB-scale last-level caches in modern server CPUs, making it possible to keep reusable model weights on chip and exploit cache bandwidth and latency. Achieving this regime is not straightforward: deeper pipelining for weight residency increases in-flight requests and KV-cache footprint, while cache-resident operators make operator-boundary synchronization a visible bottleneck. We present a cache-resident execution model for inference on hierarchical-memory clustered systems. The model separates weight-centric operators from attention and KV-cache management into dedicated resource domains, keeping reusable weights cache-resident while scaling KV capacity independently of pipeline depth. It also relaxes synchronization from operator boundaries to true sub-operator dependencies, reducing coordination overhead in the cache-resident regime. We instantiate this model on a multi-socket CPU cluster with a weight-attention decoupled architecture, locality-aware placement, and a specialized static runtime. The prototype substantially outperforms equally provisioned llama.cpp. On deployed Llama-3.2-3B and Llama-2-7B configurations, it achieves 2.04x-11.51x speedup on time-per-output-token (TPOT). Under a validated analytical model, it further reaches up to 13.9x TPOT speedup across model sizes, context lengths, and batch sizes. These results show that commodity CPUs with GB-scale last-level caches can support efficient LLM inference when execution is organized around cache residency, decoupled state management, and dependency-aware coordination.

0

cs.DC 2026-06-25

Networked FPGAs realize million-p-bit probabilistic computer

by Navid Anjum Aadit, Xiuqi Zhang +11 more

Programmable Probabilistic Computer with 1,000,000 p-bits

Performance matches a monolithic reference once boundary exchanges exceed a single frequency ratio, giving a concrete scaling rule.

abstract click to expand

Probabilistic computers built from p-bits have been proposed as hardware accelerators for sampling and optimizing Ising models, but existing systems have been confined to a single chip, capped by its capacity and memory bandwidth. Here we break this limit by networking FPGAs into a single Ising machine far larger than any one device could hold, realizing a programmable probabilistic computer with one million p-bits. The machine performs Gibbs sampling at over a trillion flips per second while keeping every coupling weight in local on-chip memory. During execution, devices exchange nothing but 1-bit boundary states. This architecture exposes a question fundamental to any distributed sampler: how frequently boundary information must be refreshed for a partitioned machine to behave as an unpartitioned one. Using three-dimensional Edwards-Anderson spin glasses, we show that the answer is set by a single timing ratio, eta = f_comm/f_p-bit, of the boundary-exchange frequency to the local p-bit update frequency. Above a topology-dependent threshold, the distributed machine matches a monolithic GPU reference. Below it, residual energy still decays as a power law but with a reduced exponent, turning parallelism into a quantifiable throughput-accuracy tradeoff. A theoretical cluster mean-field model reproduces the same behavior, showing that this tradeoff is a universal property of partitioned stochastic dynamics. These results provide a programmable million-p-bit platform, demonstrated across spin glasses, Max-Cut, and Boolean satisfiability, together with a quantitative design rule for scaling probabilistic computers beyond the single-chip limit.

0

cs.AR 2026-06-25

LLM and knowledge graph create traceable safety assertions for faults

by Xuanyi Tan, Arjun Chaudhuri +2 more

SafeGen: LLM-Driven Assertion Generation and Fault Criticality Evaluation for Functional Safety

Links design specs to assertions via HyperKG for semantic fault grading with traceable reasoning in automotive chips.

abstract click to expand

With advances in autonomous driving and electric vehicle technologies, functional safety has become a critical requirement in automotive chip design. Traditional simulation-based fault analysis is often overly conservative at the module level and fails to accurately reflect fault criticality. This paper presents SafeGen, an LLM-driven, formal-verification-assisted framework for functional-safety-oriented fault criticality assessment. SafeGen leverages large language models (LLMs) and a document-level Hyper Knowledge Graph (HyperKG) that incorporates Failure Modes, Effects, and Diagnostic Analysis (FMEDA) guidelines to extract verifiable specifications from design and safety documents and evaluate their relevance to overall system safety. The HyperKG is further enriched with register-transfer-level (RTL) information to guide the generation of Functional Safety Assertions (FSAs) that are both semantically grounded and design-aware. Each assertion is linked to its corresponding specification, enabling traceable reasoning throughout the assessment process. A gate-to-RTL fault-mapping mechanism supporting both stuck-at and bridging faults, combined with formal property verification (FPV), enables semantic-level fault criticality grading based on specification-linked assertion violations. A digital-physical co-simulation platform for a field-oriented control (FOC) system is developed to validate SafeGen. Experimental results demonstrate that SafeGen generates higher-quality assertions than existing LLM-based assertion generation frameworks while providing greater semantic interpretability in fault criticality assessment compared with traditional simulation-based approaches.

0

cs.RO 2026-06-24

Double-spiral joint shapes directional stiffness for robot hands

by Haoyang Li, Yibo Wen +3 more

PDS Joint: A Parametric Double-Spiral Joint Tailored for Dexterous Hands

Parametric design with asymmetry ratio controls multi-mode behavior and MLP calibration cuts sensing error by 41.6 percent.

abstract click to expand

Compliant joints can embed safety and adaptability into dexterous hands, but achieving large-stroke anthropomorphic motion while maintaining joint-specific, directiondependent stiffness and reliable proprioception remains challenging. This paper presents the PDS joint, a parametric doublespiral (PDS) compliant joint that enables systematic shaping of directional stiffness across multiple deformation modes, including flexion/extension, abduction/adduction, and pronation/supination. We instantiate the joint using Archimedean and logarithmic spiral templates for different hand joints and introduce an asymmetry ratio to tailor stiffness distributions for both grasp stability and hyperextension resistance. To make the joint practically usable under large deformation, we co-design embedded inductive proprioception and propose a learningbased calibration pipeline that maps raw inductive signals to joint states using ArUco-marker tracking. Experiments characterize the stiffness landscapes across geometric parameters and demonstrate a non-monotonic dependence of lateral support on asymmetry, indicating the importance of principled parameter tuning. For joint-state estimation in the most challenging abduction/adduction motion, a learned multilayer-perceptron (MLP) mapping reduces the error compared with conventional curve fitting by 41.6%. Finally, we integrate the proposed joints into an open-source dexterous hand as a demonstration platform, on which the hand grasps a set of nine everyday objects and performs safe, contact-rich human-involved interactions.

0

cs.DC 2026-06-24

Thermal balancing restores orbital AI training speed

by Shuyi Chen, Zhengchang Hua +2 more

Hot AI in Cold Space: Thermal-Crosstalk-Aware Scheduling for Sustainable Orbital AI Clusters

Migrating workloads to cooler nodes cuts throttling and extends hardware life to offset launch costs in dense space clusters.

abstract click to expand

Terrestrial AI training faces an unsustainable energy and water crisis, positioning Orbital Data Centers (ODCs) as a "zero operational carbon" alternative. However, the sub-$10\mu\text{s}$ communication latency required for synchronized scientific workloads, such as distributed Large Language Model (LLM) training, forces ODCs into extreme physical density, triggering a critical "Proximity-Thermal Paradox." As these high-density systems scale into Monolithic Structures or Proximity Swarms, they suffer from intense thermal-fluid crosstalk (heat traps in shared cooling loops) and thermal-radiative crosstalk (mutual heating that blocks deep-space cooling radiators). If left unmitigated, this persistent heat stagnation not only triggers severe thermal throttling that degrades training throughput, but also induces severe thermal fatigue, drastically shortening hardware lifespans and generating premature space e-waste. To make orbital AI truly sustainable, this position paper challenges traditional uniform load-sharing. We propose the Thermal-Aware Heterogeneity Thesis, which treats spatial cooling variances as a primary resource management dimension. Building on this, we introduce Thermal-Load Balancing (TLB), a software framework that dynamically migrates these intensive workloads to the coolest available units based on instantaneous fluid temperatures or absorbed radiation. Our analysis demonstrates that TLB resolves thermal bottlenecks to restore Model Flops Utilization (MFU), while simultaneously reducing physical thermal stress. Extending the operational lifespan of orbital hardware is crucial to amortize the massive embodied carbon of rocket launches, outlining a necessary pathway to scale orbital AI without accelerating e-waste.

0

cs.LG 2026-06-23

Scaling law predicts energy for multi-GPU transformer fine-tuning

by Mansour Zoubeirou a Mayaki

The Energy Consumption of Transformer Fine-Tuning: A Roofline-Inspired Scaling Model

Roofline-inspired model with hardware efficiency factor forecasts consumption across model sizes and parallel strategies.

abstract click to expand

Transformer-based models underpin modern natural language processing but incur rapidly growing computational and energy costs. As training scales in both model size and parallelism, accurately predicting energy consumption has become critical for sustainable and cost-aware system design. We present a framework for modeling the energy consumption of Transformer training on multiple GPUs. Using controlled architectural sweeps of BERT models, we relate measured energy to lightweight proxies for compute, memory traffic, and hardware efficiency. Inspired by roofline models, our approach incorporates a speedup-based hardware-efficiency factor that captures the effects of tensor parallelism and fully sharded data parallelism. We derive a scaling law model that accurately predicts training energy across heterogeneous configurations.

0

physics.ins-det 2026-06-23

Tiny single-layer ViT reduces detector data across hardware stages

by Abhilasha Dave, Weijian Zheng +4 more

HeteroViT: A Versatile Single-Layer Vision Transformer Concept, Co-Designed for Distributed Real-Time Data Reduction on Scientific Detectors

One backbone handles classification and rare-event detection while mapping to progressive reduction from sensor to processor.

abstract click to expand

Next-generation X-ray detectors generate data faster than any system can affordably store or process. LCLS-II, the upgraded Linac Coherent Light Source at SLAC, produces data on the order of terabytes per second, with raw-data transfer and storage projected to be prohibitively costly, even though much of the data is not scientifically useful. This concept paper focuses on two major points. The first is versatility: a deliberately tiny, single-layer Vision Transformer (ViT) is enough to serve distinct scientific quick-evaluation tasks. We demonstrate this on two very different problems: (a) a supervised hit/miss/maybe classification on the CSPAD dataset, made to resemble ePixUHR-like detector frames, and (b) a self-supervised latent space for rare-event detection in X-ray diffraction spanning two learning paradigms, two output types, and two detector modalities, with one small backbone. The second is hardware co-design: because the ViT's blocks are structurally uniform, the model maps cleanly onto the heterogeneous hardware already present in the LCLS detector pipeline (ASIC -> FPGA -> GPU) under a simple rule one ASIC is one token so the data is reduced progressively at each stage and a keep/discard decision is produced in real time at the edge. The two claims reinforce each other: versatility is precisely what justifies freezing the front-end in silicon, since a reusable front-end is only worth committing to hardware if it serves many tasks. We are explicit that this is a concept supported by early software analysis, not a hardware demonstration. The natural and primary next phase is the hardware implementation of this distributed pipeline. The decisive evidence still owed an end-to-end latency budget, ASIC feasibility of the in-sensor embedding, and the false-negative behavior that matters for a data veto defines that program. HeteroViT is our first step toward it.

0

cs.ET 2026-06-23

16-bit LFSR sets firing probability in open stochastic LIF neuron

by Poornima Kumaresan, Santhosh Sivasubramani

An Open-Source LFSR-Based Stochastic Leaky Integrate-and-Fire Neuron in SkyWater 130 nm: Design, Stochastic Characterisation, and Rate Coding

Eight-entry table and leaky integrator deliver monotonic rate coding and controlled randomness in 130 nm standard-cell CMOS.

abstract click to expand

Stochastic spiking neurons trade exact arithmetic for controlled randomness, lowering area and tolerating input noise, which suits event-driven edge hardware. We present a compact, configurable stochastic leaky integrate-and-fire neuron in standard-cell CMOS on the SkyWater 130 nm process, released openly. A 16-bit configurable-polynomial linear-feedback shift register drives an eight-entry programmable activation table that sets a Bernoulli firing probability, and a saturating 16-bit leaky integrator with a programmable threshold and a refractory period of zero to seven cycles produces the spike train. All parameters are set through a sixteen-register serial interface, and the neuron runs from parallel inputs or entirely from the register file. From a model checked bit-exact against the register-transfer code, the period is 65535 states for a maximal-length polynomial and 63 for the shipped default, the eight-bit comparison value is uniform over the full period, and the per-entry firing probability equals the table value divided by 256. We also characterise a property a system-level model would not expose: the comparator output is serially correlated at short lags, with a negative lobe near lag eight, because the compared byte shifts by one bit each cycle; subsampling every sixteen cycles restores whiteness. Rate-coding sweeps show monotonic control of the output rate by the input weight and the threshold, and the refractory period caps the rate at one spike per refractory-plus-one cycles. The neuron occupies about 10,600 square micrometres at 70 per cent utilisation on a single Tiny Tapeout tile, meets 50 MHz timing with positive margin, and passes eighteen directed cocotb tests at register-transfer and gate level. All results are pre-silicon, from simulation and the open flow. The neuron is an openly released companion to a four-block neuromorphic suite reported separately.

0

cs.CR 2026-06-23

CUDA latency maps create certificates for cloud GPU identity

by Faruk Alpay, Taylan Alpay

Unprivileged Topology Certificates for Cloud GPU Attestation

Software-only measurements attest hardware class and location within 44 km using stable per-SM fingerprints verifiable without a GPU.

abstract click to expand

Cloud GPU tenants receive a model name and a region, but cannot directly inspect the physical accelerator that runs their job. We present a software-only attestation primitive for this setting. A CUDA probe measures an SM-by-memory-region latency matrix using physical SM labels and dependent global loads. A streaming reducer commits sufficient statistics, configuration, code hashes, network evidence, and a compressed raw data archive into a certificate that a verifier can check without a GPU. The certificate supports three claims. First, the per-SM latency map is a stable physical fingerprint. Over a six-hour full-load RTX 5090 run, its median temporal jitter is 0.09 cycles, while shape-only leave-one-out classification separates distinct Blackwell dies with 100.0% accuracy. Second, cache-bypassing HBM sweeps recover hardware-class topology across generations, including a unified Volta V100 memory domain, a two-way Hopper H200 L2 split, and a Blackwell B200 two-die NV-HBI package whose 74/74 SM partition carries a 30-cycle, 15.5 ns cross-die penalty. Third, public network landmarks bind the same certificate to a coarse location. In the B200 run, 169 RIPE Atlas probes place the server within 44 km of its claimed datacentre and reject all 11 decoy sites. Together, these measurements check cloud-GPU identity, class, and coarse location without privileged access or a vendor key.

0

cs.AR 2026-06-23

Golden models lift LLM Verilog repair rate from 54% to 86%

by Yihan Wang, Cheng Liu +5 more

VeriPilot: An LLM-Powered Verilog Debugging Framework

Internal variable alignment and CDFG tracing let the model locate root causes distant from test outputs.

abstract click to expand

Verilog debugging remains one of the most time-consuming stages in digital circuit design. Recent advances in Large Language Models (LLMs) have enabled automated debugging; however, most existing approaches rely solely on test outputs and compiler feedback in an end-to-end manner, limiting their effectiveness on complex bugs. A key challenge is that the root cause of an error may be far removed from its observable outputs, making it difficult for LLMs to trace long dependency chains in code. This challenge is further exacerbated in large codebases, where long context lengths hinder efficient reasoning. To address these limitations, we propose VeriPilot, an LLM-powered debugging framework that leverages golden reference models to enable fine-grained bug localization and repair. VeriPilot goes beyond output-level comparison by aligning internal variable semantics between the Verilog design and its corresponding golden model through LLM-based analysis. It then performs step-by-step signal tracing using Control-Data-Flow Graphs (CDFGs) derived from static analysis, identifying a minimal set of suspicious code regions along with their correct counterparts from the golden model. These structured insights are subsequently provided to the LLM to guide reasoning and automated code repair. Experimental results on the Comprehensive Verilog Design Problems (CVDP) benchmark from NVIDIA demonstrate that VeriPilot improves the repair success rate of GPT-4o from 54.3\% to 85.71\%, significantly enhancing both bug localization accuracy and repair effectiveness for complex Verilog designs. The source code and benchmark are publicly available at Github https://github.com/YihanWn/VeriPilot.git.

0

cs.AR 2026-06-23

Balanced chunking cuts LLM prefill latency 76% on wafer chips

by Zichuan Wang, Huizheng Wang +7 more

MOCAP: Wafer-Scale-Chip-Oriented Memory-Orchestrated Chunked Pipelining Framework for Prefill-Only LLM Inference

KV cache reallocation and latency-aware partitioning remove imbalance from causal attention to raise throughput and supported length.

abstract click to expand

Large language models (LLMs) are increasingly used in prefill-only workloads, where end-to-end latency is dominated by the prefill phase. For long-context prefill, communication overhead grows with sequence length and quickly becomes a bottleneck on conventional GPU systems, making wafer-scale chips (WSCs) a promising substrate due to their high communication bandwidth and large aggregate compute and memory capacity. A natural way to accelerate prefill is to partition a long input sequence into multiple chunks and execute them in a finer-grained pipeline across devices. However, directly applying this idea to long-context prefill on WSCs remains challenging. First, causal dependency across chunks causes KV cache to accumulate unevenly across pipeline stages, creating severe memory imbalance and limiting the feasible sequence length. Second, later chunks require more attention computation because each chunk depends on preceding chunks, leading to chunk-level latency imbalance. To address these challenges, we present MOCAP, a memory-orchestrated chunked pipelining framework for prefill-only LLM inference on WSCs. MOCAP introduces Memory-Balanced KV Reallocation (MBKR) to alleviate memory imbalance by redistributing KV cache across pipeline stages, thereby extending the feasible sequence length. It further incorporates Latency-Balanced Chunk Partitioning (LBCP) to balance chunk execution cost under both attention-cost growth and KV reallocation overhead, improving pipeline efficiency. Experimental results show that, compared with GPipe, MOCAP achieves 76.4\% lower end-to-end latency and 3.24$\times$ higher throughput on average. MOCAP also extends the maximum supported sequence length by up to 1.31$\times$ compared with Terapipe.

0

cs.AR 2026-06-23

Clutch speeds vector-scalar comparisons in DRAM 2.9x over prior PuD

by Daichi Tokuda, Tatsuya Kubo +9 more

Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding

Chunked temporal coding cuts command count enough to beat CPU and GPU by 12x on average in database and ML tasks

abstract click to expand

Vector-scalar comparison is a fundamental computation primitive that compares each element in a vector against a single scalar value. It is widely used in various data-intensive workloads from databases to machine learning. Due to its low computational intensity, its execution tends to be memory-bound, limiting the utilization of compute resources. Processing-using-DRAM (PuD) is an emerging computing paradigm that performs massively parallel bitwise operations directly inside DRAM arrays, alleviating off-chip data movement. Existing PuD-based approaches require many DRAM commands because the comparison's algorithmic complexity grows with operand bit-width in the bit-serial execution model. This command overhead becomes the dominant bottleneck, limiting application-level speedup. We propose Clutch, a data representation and comparison algorithm that accelerates vector-scalar comparisons in PuD systems with high efficiency and scalability. Clutch first uses temporal coding, encoding each vector value as a sequence of leading ones, which enables lookup-based comparison against a scalar by accessing the corresponding DRAM row. To avoid the prohibitive memory footprint of lookup tables at high precision, Clutch partitions operands into multiple multi-bit chunks, compares chunks independently using compact lookup tables, and merges the per-chunk results with a PuD-efficient procedure. By adjusting the number of chunks, Clutch provides a flexible tradeoff between throughput and memory usage. Across predicate evaluation and decision tree inference, Clutch improves end-to-end application throughput and energy efficiency by an average of 12x and 69x over highly optimized CPU and GPU execution, and by 2.9x and 3.0x over the state-of-the-art bit-serial PuD implementation. We also present the first mapping of decision tree inference to PuD execution, extending PuD to a new application domain.

0

cs.AR 2026-06-22

Computer systems must be redesigned for biological data analysis

by Nika Mansouri Ghiasi, Konstantina Koliogeorgi +1 more

Architecture for Health Initiative (Arch4Health): Computational Challenges in Health-Related Applications and the Role of Computer Architecture in Addressing Them

High-throughput biotech data outpaces conventional hardware, requiring architecture changes to deliver efficient, private healthcare process

abstract click to expand

Recent biotechnological advances enable high-throughput, low-cost, and accurate biological data generation. This wealth of data enables unique opportunities for advancing healthcare. Despite these opportunities, efficiently analyzing large-scale biological data poses significant challenges for conventional computing systems. These systems often cannot keep up with the high-throughput rate at which data is generated, and they face additional constraints related to energy efficiency, scalability, privacy, and security. Therefore, to facilitate the wide adoption of recent advances in healthcare, there is a need to optimize the computing systems to enable high-performance, energy-efficient, low-cost, private, and secure analysis of biological data. We introduce the Architecture for Health (Arch4Health) initiative, which aims to (i) identify and analyze key computational challenges in current and future health- and life science-related applications and (ii) explore how computer architects and computing system designers can advance healthcare by addressing these challenges. In this short paper, we first present the motivations behind the Arch4Health initiative and, second, elaborate on its vision and goals, related topics, Arch4Health workshops, and future outlooks.

0

cs.ET 2026-06-22

Four IP blocks share one interface for neuromorphic sensing and learning

by Poornima Kumaresan, Santhosh Sivasubramani

Design and Development of a Neuromorphic Silicon Suite: PVT Sensing, Stochastic LIF Inference, On-Chip STDP Learning, and Crossbar Programming

PVT sensor, stochastic neuron, STDP controller and crossbar driver all use the same SPI register file in 130 nm CMOS

abstract click to expand

Edge neuromorphic systems need compact, configurable hardware that combines probabilistic inference, local learning, and an interface to emerging analogue memory. We present four interface-compatible digital IP blocks implemented as standard-cell CMOS on the SkyWater 130 nm process: a process, voltage and temperature (PVT) sensor built from five selectable ring oscillators that also provides a jitter-based true-random-number generator and a frequency-bounds health monitor; a stochastic leaky integrate-and-fire (LIF) neuron with a configurable LFSR, a programmable activation table, and a refractory period; an on-chip spike-timing-dependent plasticity (STDP) controller with a programmable curve and reward-modulated, eligibility-trace, and anti-Hebbian modes; and a memristive-crossbar controller supporting forming, set, reset, read, and automated current-voltage sweep with current-compliance limiting and half-select biasing. All four blocks share a common serial peripheral interface (SPI) register file; the sensor also exposes a parallel readout. Each occupies a single tile at a 50 MHz target. The suite was verified with 99 cocotb tests at register-transfer and gate level (all passing) and taken through an open standard-cell flow, then submitted for tapeout via the Tiny Tapeout shared-silicon programme. Mapped to the open cell library, each block occupies a post-synthesis cell area of 9.3 to 10.6 thousand square micrometres, places at 61 to 70 per cent tile utilisation, meets the 50 MHz constraint with positive setup and hold margin after clock-tree synthesis, and draws an estimated 0.64 to 0.70 mW under a default switching-activity assumption. The contribution is a coherent, openly released set of building blocks unified by one register interface and one verification flow. All results are from simulation and the implementation flow; no fabricated silicon is reported.

0

cs.CR 2026-06-22

ColumnDisturb in DRAM blocked at 0.15% slowdown

by Andreas Kosmas Kakolyris, F. Nisa Bostanci +8 more

ColumnKeeper: Efficient Solutions to the ColumnDisturb Vulnerability in DRAM-based Systems

Counters on odd and even columns or random refreshes across subarrays prevent column bitflips with negligible cost.

abstract click to expand

Modern DRAM chips are vulnerable to read disturbance phenomena such as RowHammer and RowPress, which induce bitflips after accessing nearby rows a certain number of times (the read disturbance threshold). ColumnDisturb is a new, fundamentally different DRAM read disturbance phenomenon. Specifically, ColumnDisturb (i) disturbs DRAM columns instead of rows, and (ii) increases the number of affected DRAM cells from those in only a few neighboring rows to all cells across three consecutive DRAM subarrays. We propose ColumnKeeper, the first set of ColumnDisturb mitigations, in two variants: ColumnKeeper-D (CK-D), a deterministic mechanism, and ColumnKeeper-P (CK-P), a probabilistic one. CK-D exploits DRAM's open-bitline architecture to provide deterministic security guarantees at low performance and energy overheads: it uses two counters per subarray to track activations affecting the odd and even columns, and refreshes one row in a subarray when either counter reaches a predetermined threshold. CK-P instead refreshes one row in three consecutive subarrays upon a row activation in the middle subarray, with a predetermined probability, providing configurable security guarantees at low area overhead. Both mechanisms prevent ColumnDisturb bitflips at low performance, energy, and area overheads. At the current experimentally-demonstrated ColumnDisturb threshold (1M), CK-D and CK-P incur very low average single-core performance overheads of 0.15% and 0.36%, respectively. For near-future thresholds (128K), these rise to a still low average of 1.70% and 2.73%. Mitigating ColumnDisturb at low thresholds (e.g., 16K) remains possible by adopting smaller subarray sizes or enabling subarray-level parallelism. CK-D and CK-P require low area overheads of 0.1 mm^2 and 0.03 mm^2, respectively. ColumnKeeper is freely available at https://github.com/CMU-SAFARI/ColumnKeeper .

0

cs.AR 2026-06-22

Memristor crossbar supports multi-level analog weights for on-chip LLMs

by David Alejandro Trejo Pizzo

Multi-Level Resistive Synapses for On-Chip Neural Networks: A Physics-Based Design of a Memristive Crossbar Fabric with Quasi-Continuous Conductance States

Physics-derived conductance states enable in-memory inference and learning with projected efficiency gains orders of magnitude above CPUs fo

abstract click to expand

Building on resistive communication, this paper presents a physics-based design of an on-chip neural network with multi-level memristive synapses supporting a dense spectrum of conductance states. Derived from ionic transport physics, we develop a state-variable model and quantify storable sub-levels under thermal noise, drift, and quantized conductance. We assemble these devices into a 1T1R crossbar fabric, derive the linear algebra of analog vector-matrix multiplication (VMM) under wire resistance, and design a differential synapse for signed weights. A multilayer pipeline executes inference, backpropagation, and weight updates physically in the analog domain. We derive the in-situ outer-product learning rule, its discretization onto the conductance lattice, and the resulting quantization noise. We provide energy, area, capacity, and inter-tile models, showing this substrate is ideally suited for large language models (LLMs). Our design eliminates weight movement, surpassing binary ReRAM and traditional CMOS. We detail the material stack (HfO_2-based), the FEOL/BEOL CMOS foundry-integration flow, a self-contained SPICE model, the complete memristive-FPGA neuromorphic system, and an in-memory self-attention engine with current-mode translinear softmax. Finally, a ternary BitNet datapath shows projected per-token efficiency orders of magnitude better than advanced CPUs or GPUs. The result is a self-contained hardware-native blueprint for a high-density, analog, in-memory neural processor.

0

cs.AR 2026-06-22

L2 hit latency spans 222-339 cycles by L40 SM

by Faruk Alpay, Baris Basaran

Non-Uniform L2 Cache Latency Across the Streaming Multiprocessors of an NVIDIA L40

A single-launch probe maps 52 percent variation across 142 SMs, enabling placement-aware scheduling and device fingerprinting.

abstract click to expand

The NVIDIA L40 exposes a 96 MiB L2 cache usually modeled as one uniform pool with a single hit latency. We show this is wrong at the granularity a kernel sees: L2-hit latency depends strongly and reproducibly on which physical streaming multiprocessor (SM) issues the load. A turn-serialized, %smid-resolved probe maps the hit latency across all 142 SMs in one launch; it is not a constant near 279 cycles but spans 222-339 cycles (a 52% range), with per-repetition noise below 0.01 cycles. An additive model $L = \mu + a(\mathrm{sm}) + b(\mathrm{slice})$ explains $R^2 = 0.87$ (0.98 with one rank-1 term), and the SM term is two-fold symmetric (two halves of 72 SMs at correlation $r = 0.999$), following the AD102 GPC layout. Independent access patterns agree per SM at $r = 1.000$, so the effect is physical. The same probe on a Blackwell RTX 5090 shows it generalizes, while the per-die pattern is device-specific. Read as a fingerprint, a single user-level probe identifies the SM within a device at 92%, and two physically identical L40s are separated at 100% despite near-identical mean latency (per-SM map $r = 0.63$): a per-die hardware identity, not a clock artifact. This is a self-localization and fingerprinting primitive: a kernel reads its own placement and device, not a victim's, and extracts no secret data. The map is stable, unchanged after an hour at full utilization on both devices. As a consequence, distributing latency-bound work by the map cuts makespan by up to 11%. Single-thread capacity, line-tag, prefetch-modifier, and persisting-L2 results appear as controls. The artifact contains seeds, raw observations, the trained model, and regeneration scripts.

0

cs.LG 2026-06-22

Nonlinear connections cut nodes needed for smooth analogue control

by Ian T. Vidamour, Fernando Aguirre +14 more

Low-power analogue neural networks with trainable nonlinear connections for continuous control

Networks match multilayer perceptrons on classification but use far fewer parameters on robotic trajectories and power tracking, transferrin

abstract click to expand

Physical neural networks promise low-power machine learning by computing directly with analogue device physics, but most architectures force nonlinear device responses to act as scalar weights. Inspired by Kolmogorov-Arnold networks, we place trainable nonlinear functions on the connections, making each physical connection a learnable computational element. Realising these functions as analogue band-pass filters on field-programmable analogue arrays, we find that the benefit is task-dependent and follows from the smoothness of the physical basis: the networks represent smooth, continuously valued targets, including robotic kinematics, continuous control, and photovoltaic maximum-power-point tracking, with far fewer nodes and connections than multilayer perceptrons, but offer no parameter-efficiency advantage on classification-like decision boundaries. Trained networks transfer to hardware across approximately 35,000 connections with quantified fidelity, and a dedicated CMOS implementation is projected to operate at approximately 30 microwatts. A memristive realisation reproduces the same behaviour in simulation, indicating that the advantage comes from placing trainable nonlinearity on connections, rather than from a particular device.

0

cs.AR 2026-06-22

Coordination lifts NPU sparse matrix speed 1.26-7.78x

by Xin Ai, Zeyu Ling +5 more

NeutronSparse: Coordinating Heterogeneous Engines for Sparse Matrix Multiplication on NPUs

Adaptive workload balancing and tile reuse let Ascend 910B match or beat GPU libraries on SpMM.

abstract click to expand

Sparse matrix-matrix multiplication (SpMM) is a fundamental data operation for large-scale sparse data processing. With NPUs increasingly deployed in data centers for their performance and energy efficiency, accelerating SpMM on these platforms is a natural choice. However, high-performance SpMM on NPUs poses a data management challenge, as irregular sparsity demands efficient data organization and scheduling. On Ascend 910B, the official MindSpore implementation achieves only 36.3% of the performance of GPU-based sparse libraries such as cuSPARSE on NVIDIA A100. To this end, we conduct an in-depth architectural analysis of SpMM execution on NPUs versus GPU and identify that the key performance bottleneck for SpMM on NPUs lies in the lack of efficient coordination across heterogeneous compute units under tile-based execution model. Therefore, we propose NeutronSparse, a coordination-first SpMM framework for NPUs. NeutronSparse integrates two key techniques: (i) Sparsity-aware coordination of heterogeneous engines, which adaptively partitions and balances workloads between heterogeneous compute units to keep them busy, and (ii) Locality-aware tile orchestrating, which reorganizes and reuses data tiles to reduce redundant computation and memory movement overhead. Evaluations on Ascend 910B show that NeutronSparse achieves 1.26x-7.78x speedup over NPU baselines and 1.03x-3.07x speedup over leading GPU libraries on NVIDIA A100, revealing untapped potential of NPUs for sparse computation.

0

cs.AR 2026-06-22

Second write to DRAM row shifts read-disturbance threshold

by Haocong Luo, İsmail Emir Yüksel +5 more

DejaVu: Why You Should Write to Your DRAM Rows Twice, Carefully

Opposite data lowers the activation count for bitflips while repeated data raises it, shown across 112 DDR4 chips

abstract click to expand

We provide the first experimental demonstration of DejaVu, a phenomenon where the data previously written to DRAM cells affects DRAM's vulnerability to read disturbance. Our experimental characterization using 112 COTS DDR4 DRAM chips from all three major manufacturers shows that, compared to the baseline where we initialize the victim row by writing to it only once, 1) overwriting it with the opposite data reduces ACmin, the minimum aggressor row activation count to induce a bitflip, and 2) writing the same data twice increases ACmin. We provide two hypotheses to explain DejaVu. First, we hypothesize that overwriting the victim row with opposite data values causes under-restoration of charge in DRAM cells. Second, we hypothesize that overwriting the victim row changes charge trap states in the active region, affecting read-disturbance-induced cell leakage current. We conduct controlled characterization to provide insight into these hypotheses. We further characterize the reliability of Processing-Using-DRAM (PUD) operations with DRAM rows initialized with DejaVu patterns. Our characterization of 32-row MAJ-3 operation shows that overwriting the DRAM rows used in the operation reduces the number of bitlines that fail to reliably perform MAJ-3 by 32.7% on average compared to the baseline where rows are written only once. Based on our observations, we describe two major implications of DejaVu. We show how DRAM testing and characterization methodologies should account for DejaVu to accurately characterize read disturbance vulnerability under fixed data patterns and rigorously study data-pattern effects without unintended interference from DejaVu. We also evaluate the performance overhead of read disturbance mitigation techniques when thresholds need to be lowered to be secure against DejaVu, showing a 6.3% overhead when reducing the threshold by 20%.

0

cs.AR 2026-06-22

Reverse engineering maps Apple Neural Engine internals

by Spencer H. Bryngelson

Apple Neural Engine: Architecture, Programming, and Performance

Datapath, compiler format, weight compression, and command protocol detailed across A11 to A18 and M1 to M5 chips.

abstract click to expand

The Apple Neural Engine (ANE) is the fixed-function matrix accelerator that has shipped in Apple systems-on-chip since the A11-class iPhone and iPad chips and the M1-class Mac chips, exposed to applications only through the Core ML model framework. This guide reports a reverse-engineered account of the engine, based on direct measurement on Apple silicon and static analysis of the private runtime, compiler, kernel driver, and firmware. It documents the datapath and the roofline that bound the engine's throughput and energy, the dispatch route that reaches it below Core ML, the compiler and on-disk program format, the weight-compression scheme, and the kernel driver, firmware, and command protocol beneath them. The account covers the A11 through A18 and M1 through M5 families, with per-chip target tables and an operation-by-device matrix; the direct measurements are on the M1 and M5. Claims are labeled as measured, decompile-derived, or predicted, and the methodology and open questions are recorded. The direct route is callable from ordinary user space but remains undocumented, unsupported, and version-fragile; it is intended for measurement, research, and on-device work, not for shipping software, where Core ML remains the supported path.

0

cs.CR 2026-06-22

Trusted interposer blocks chiplet coherence attacks by construction

by Charles Williams, Mohammed Nabeel +4 more

2.5D Root of Trust: Securing the Chiplet Ecosystem

Monitors embedded in the interposer enforce permissions on untrusted chiplets without any modifications to those dies.

abstract click to expand

The semiconductor industry is rapidly transitioning from monolithic systems-on-chip toward heterogeneous, multi-vendor 2.5D chiplet ecosystems integrated via silicon interposers. While this paradigm shift offers immense benefits in yield, cost, and time-to-market, it radically expands the attack surface. Integrating chiplets from untrusted foundries and design houses introduces vulnerabilities to hardware Trojans, IP piracy, and system-level communication exploits. Critically, chip-level security features and conventional Root of Trust (RoT) proposals are insufficient in this context: any component, including the interconnect fabric itself, may be sourced from an untrusted vendor. This perspective paper surveys state-of-the-art security strategies for interposer-based 2.5D integration, focusing on three threat categories: interconnect attacks (snooping, spoofing, and man-in-the-middle), cache coherence exploits including complex forging attacks, and microarchitectural side-channel threats. We examine design-time defenses via 2.5D split manufacturing and, more crucially, runtime defenses that establish an active interposer as a physically isolated 2.5D RoT. By embedding so-called transaction monitors and coherence message checkers within the trusted interposer fabric, the system enforces memory access permissions by construction and neutralizes coherence-level attacks without need for modifying/securing the commodity chiplets. Finally, we review the EDA flows required to realize these defenses and show they concurrently improve power and signal integrity while reducing overall system footprint.

0

quant-ph 2026-06-22

2D T-junction ion traps lower shuttling costs versus 1D

by J.Durandau, C.A. Brunet +4 more

Shuttling in Bidimensional Segmented Ion-Trap Quantum Processors with T-Junctions

Advantage grows with ion number when junction and linear move costs are equal.

abstract click to expand

Shuttle-based trapped ion quantum processors typically employ a one-dimensional (1D) linear architecture to transport ion-qubits between one ore more laser interaction zones where the quantum gates are implemented, along with several qubit register storage segments. The two-dimensional (2D) quantum CCD architecture employs also T- or X-junctions for an improved scaling and efficiency. Here, we explore the shuttling layer in the compilation of quantum algorithm typical building blocks in such architecture. To weight the effort of linear shuttle and junction shuttle, we introduce individual cost functions for each operation. This allows comparing the total cost for quantum circuit building blocks such as the QFT, Carry, Adder, Shift, and Comparator circuits. We study their scaling properties with increased qubit numbers. At equivalent transport cost for junction and linear shuttling, we show that 2D architectures outperform the 1D linear trap with the ratio improving as the number of ions increases. Finally, we discuss the use of cells, such that the entire processor is constructed from a 2D array of such interconnected cells. The work aims to optimize quantum processor architectures, implementing a co-design that fits to the specific task and scaling up in a shuttle-efficient way.

0

cs.AR 2026-06-22

LLM agent finds better architectures with 100x fewer simulations

by Chenyu Wang, Jiahe Caroline Shi +5 more

AgentDSE: Reasoning-Augmented Architectural Design Space Exploration

General-purpose coding agent reasons through constraints and bottlenecks to match expert results across multiple domains.

abstract click to expand

Traditional architectural design space exploration (DSE) is highly inefficient, typically requiring tens of thousands of simulator evaluations across various optimization methods. This inefficiency arises because conventional methods treat the simulator as a black-box oracle. In contrast, human architects effectively guide exploration by reasoning through physical constraints, performance bottlenecks, data reuse, and workload structures. To bridge this gap, we introduce AgentDSE, a simulator-in-the-loop methodology driven by a general-purpose large language model (LLM) coding agent. AgentDSE automates this architectural-reasoning loop without requiring model fine-tuning, precomputed design databases, or domain-specific optimizer code. Across deep neural network (DNN) accelerator mapping, hardware/software co-design, and CPU cache-hierarchy optimization, AgentDSE achieves competitive or better design quality with up to two orders of magnitude fewer evaluations. AgentDSE also produces inspectable traces that surface architectural hypotheses, performance cliffs, implicit priors, and simulator artifacts, making every search decision traceable rather than buried in optimizer state.

0

cond-mat.mes-hall 2026-06-22

Transcapacitor cuts logic energy by 100 times via capacitance modulation

by Amrita Mathuriya, Roza Kotlyar +10 more

Solid-state transcapacitor, a new gain element for logic, memory and interconnects

Gate stress on polar channels replaces current flow, removing Boltzmann limits and enabling dense memory at lower voltage.

abstract click to expand

Today's transistors dictate the voltage and charge scales for both logic and memory. While AI systems are recognized to be limited by memory energy, the dominant share of the energy is expended in the intrachip interconnects whose voltage and charge scales are set by transistors. The energy scaling challenges of transistors can be attributed to simultaneously meeting high current density, high current/impedance modulation, and the inability to lower voltages. Hence, a new logic element that lowers the voltage and charge needs is a priority, not only for lowering logic power but also memory access power. Here, we propose a novel 3-terminal logic element for low energy computing, a solid-state transcapacitor (TCAP). A TCAP is a solid state displacement current modulator realized by a gate which controls the charge-voltage relationship of the channel. Unlike transistors, TCAPs eliminate the dissipative transport current, are not bound by the Boltzmann current modulation limit, and operate with displacement currents limited only by the polarization response and contact resistance. Hence, TCAP circuits may simultaneously overcome the voltage, current density, and current modulation limits of CMOS. We describe a solid state TCAP using a piezoelectric transcapacitor in which a gate-controlled stressor modulates the capacitance of a polar channel via electromechanical coupling. This device achieves inversion and gain, essential for logic, and is functionally equivalent to a 1T-1C memory cell, enabling dense memory. Using voltage scaling, capacitive energy recovery, and high polarization densities of polar materials, the logic based on TCAP offers a pathway to 100 fold lower energy consumption with a delay comparable to ultimately scaled CMOS devices. This approach provides a new potential pathway for low-energy computing beyond the limits of transistors using electro-mechanics and multiferroics.

0

cs.AR 2026-06-22

Quantized rows cut analog layout gap by 68.5%

by Endalk Y. Gebru, Ramprasath S. +2 more

Row-Based Layout Synthesis for Analog Circuits Using Height-Quantized Primitives

Synthesis maps blocks into fixed-height rows, matching custom performance and lowering area up to 24.1%.

abstract click to expand

Restrictive design rules and strong layout-dependent effects have tightened the coupling between physical layout decisions and electrical performance in advanced process nodes, such as FinFET, making analog and mixed-signal (AMS) layout automation increasingly difficult. This paper presents a quantized row-height layout synthesis methodology for AMS circuits, a methodology that has previously been shown to reduce the simulation-to-silicon gap. The proposed flow optimizes a row height fabric from circuit requirements and layout constraints while mapping analog building blocks into quantized-height rows. Results on multiple testcases demonstrate that the proposed flow synthesizes layouts with similar postlayout performance relative to less-constrained custom baseline designs, with comparable performance metrics. Our quantized-height designs are shown to reduce the schematic-to-postlayout performance gap by up to 68.5% and result in lower area for most of our testcases, with a maximum area reduction of 24.1%.

0