cs.PF — Pith

0

cs.PF 2026-07-03

Busy fractions identify MAP parameters for QBD queues

by Chen Li, Junjun Zheng +2 more

Markovian Arrival Process Parameter Estimation of Quasi-birth-death Queueing Systems with Utilization Data

An EM algorithm derives the necessary expected statistics from utilization intervals alone, allowing parameter estimation without event-leve

abstract click to expand

Parameter estimation for queueing systems is commonly performed using inter-arrival times, waiting times, or queue-length observations. However, such detailed observations are often unavailable in practical computer systems, where utilization data, such as CPU utilization, is much easier to collect. Utilization data provides only the fraction of time during which the system is busy within each monitoring interval, while the exact arrivals, services, phase transitions, and system states in unobservable periods remain hidden. This paper proposes an expectation-maximization (EM) algorithm for estimating the parameters of Markovian arrival process (MAP)-driven quasi-birth-death (QBD) queueing systems from utilization data. The proposed method formulates the underlying queueing dynamics as a QBD process and derives the expected sufficient statistics for sojourn times, phase transitions, arrivals, and services over both observable and unobservable intervals. These expectations are then used to iteratively update the MAP and service parameters under the maximum likelihood framework. In addition, Akaike's information criterion is introduced to select the appropriate number of MAP phases and mitigate overfitting. The proposed framework enables MAP-based queueing parameter estimation from incomplete utilization observations and provides a practical modeling approach for systems where detailed event-level measurements are difficult to obtain.

0

cs.SE 2026-07-02

Stochastic model estimates microservice availability from traces

by Anatoly A. Krasnovsky, Anna Maslovskaya

Stochastic Connectivity as the Foundation of a Runtime Model for Microservice Availability Analysis

Monte Carlo on reconstructed graphs and probability measures replaces repeated fault-injection tests

abstract click to expand

Microservice availability is commonly assessed by fault injection and chaos experiments, but such experiments are costly, operationally risky, and difficult to repeat for every architectural change. Distributed tracing and deployment metadata provide cheaper evidence, yet they usually remain descriptive: they show which services interacted, not what endpoint-level availability property follows. This paper proposes a formal runtime availability model based on stochastic connectivity for resilience-oriented analysis of microservice endpoints. It treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates. Its semantics separates computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies. The model can be reconstructed from traces and deployment artifacts, parameterized for architectural what-if analysis, and analyzed by Monte Carlo simulation before or alongside fault injection. We define the model, its trace-to-model construction, elementary semantic properties, and a synthetic adequacy study. The study matches closed-form oracle cases within sampling error and exposes boundaries caused by edge bottlenecks, correlated failures, missing traces, and time-dependent failures.

0

cs.CV 2026-07-02

Plain ViT leads segmentation throughput at every resolution

by Tobias Christian Nauen, Anosh Billimoria +4 more

LUMA: Benchmarking Segmentation via a Lightweight Universal Mask Adapter

Fixed LUMA head shows pretraining objectives matter more than architecture across 20 backbones on ADE20K and Cityscapes.

abstract click to expand

Comparing transformer backbones for image segmentation is confounded: each is paired with a different decoder, recipe, and pretraining, so reported differences rarely reflect the backbone itself. We introduce the Lightweight Universal Mask Adapter (LUMA), a lightweight, backbone-agnostic mask-transformer head that treats any backbone as a black-box feature extractor, letting a set of queries read from its features through cheap cross-attention. LUMA matches the accuracy of EoMT, the state-of-the-art efficient ViT-segmenter, at lower cost, while attaching unchanged to isotropic, hierarchical, convolutional, and mixture-of-experts backbones alike. Holding this head fixed, we benchmark 20 backbones, 11 pretraining schemes and a range of resolutions on ADE20K and Cityscapes under one modern recipe. We find that ``efficient'' token mixers fail to deliver efficiency even at the high resolutions that motivate them, with plain ViT holding the throughput Pareto-front at every resolution. Additionally, the pretraining objective, not the architecture, the lever the field has tuned hardest, governs segmentation quality.

0

cs.CL 2026-07-02

BaseRT achieves 1.56x higher LLM decode speed on Apple Silicon

by Prabod Rathnayaka, Fabian Waschkowski +1 more

BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal

Native Metal kernels and direct memory access outperform general frameworks across model sizes and quantizations.

abstract click to expand

We present BaseRT, a native Metal inference runtime for large language models (LLMs) on Apple Silicon, and report the highest inference throughput on this hardware to date. Existing runtimes, including llama.cpp and MLX-based frameworks, incur overhead from abstractions not designed for Metal's execution model or Apple Silicon's unified memory topology. By building natively on Metal with chip-specific kernel fusion, unified memory-aware optimisation, and custom dispatch logic, BaseRT recovers performance that framework-based approaches leave on the table. BaseRT supports a wide range of model families across eight quantisation formats (Q2 to FP16) on all Apple M-series devices. In this paper, we evaluate the Qwen3, Llama 3.2, and Gemma 4 families at Q4 and Q8 quantisation on M3 and M4 Pro devices. BaseRT achieves up to 1.56x higher decode throughput than llama.cpp and up to 1.35x higher than MLX, with substantially larger margins on prefill for mixture-of-experts models, delivering consistent best-in-class throughput from sub-1B to 30B parameter models. These results establish Apple Silicon as a more capable inference platform than previously reported, with direct implications for the emerging edge inference paradigm: as privacy requirements, latency constraints, and cloud cost pressures drive inference toward on-device deployment, performance-optimised local runtimes are a critical enabling layer for this transition. BaseRT is publicly available at https://github.com/basecompute/baseRT

0

cs.SE 2026-07-01

LLVM -O3 pipeline regresses in 6.6-9.7% of pass steps

by Federico Bruzzone, Walter Cazzola

A Multi-Dimensional, Per-Pass Empirical Study of the LLVM Optimization Pipeline

Per-prefix measurements on 30 kernels show most gains arrive late and the final config loses on size-speedup for 29 kernels.

abstract click to expand

Quantifying the marginal impact of individual optimization passes underpins phase ordering, pass selection, optimization design, and analysis of pass/hardware interactions. In LLVM -- the standard backend for C/C++, Rust, and ML stacks via MLIR -- interactions among optimization passes, measurement noise, and pipeline scale make this difficult. We present a systematic, empirical study of the LLVM -O3 optimization pipeline. We decompose the pipeline into cumulative per-pass prefixes. We then measure execution time, compile time, binary size, hardware counters, and RAPL energy across 84,750 measurements covering 113 cumulative prefixes of the -O3 pipeline evaluated on 30 PolyBench/C kernels under rigorous noise mitigation. On these compute-bound affine kernels, the pipeline is non-monotone (6.6-9.7% of transitions regress) and strongly back-loaded (the median non-regressing kernel needs 84.8% of the pipeline for 80% of its speedup). Most gains are driven by a small Pareto-dominant core of passes, while the final -O3 configuration is Pareto-dominated on (size, speedup) for 29 of 30 kernels. We further show that IR instruction count is an unreliable predictor of runtime, that runtime-targeted passes are de facto energy-targeted (30-60% savings), and that the search-free idealized-additive upper bound on losses due to phase interference is 46.35%. These findings enable more informed pass pruning, cost-model calibration, and autotuning.

0

cs.CC 2026-06-30

Access cost grows as fourth root of data size

by Chen Ding

The Fourth-Root Complexity of Data Movement

Abstract memory hierarchy shows per-access costs scale as N to the 1/4 for common apps, distinguishing power-law from exponential miss ratio

abstract click to expand

Time complexity typically assumes $O(1)$ cost per data access. This paper presents an analysis based on an abstract memory hierarchy. For a common class of applications, it shows that the data-access cost scales with the fourth root of data size, that is, as data size $N$ increases, the cost of each access increases at the rate of $N^\frac{1}{4}$. While the analysis does not predict performance, it predicts scalability. Specifically, the paper provides a precise analysis that shows the constant-factor difference between cases where the miss ratio follows a power law versus an exponential decay.

0

cs.LG 2026-06-30

Coding agent workloads show long loops and imperfect caches

by Kan Zhu, Mathew Jacob +5 more

TraceLab: Characterizing Coding Agent Workloads for LLM Serving

Trace of 4300 sessions finds patterns that suggest specific serving optimizations for agentic LLMs

abstract click to expand

Coding agents are rapidly becoming a major application of agentic LLMs, but serving them efficiently remains challenging. Progress on this challenge requires understanding real workload patterns, yet the data needed for such analysis is largely absent. Existing public traces and benchmarks do not capture real, day-to-day coding-agent usage across multiple agents and model families for serving-system analysis. To help fill this gap, we collect and release a trace of roughly 4,300 coding-agent sessions, containing about 350,000 LLM steps and 430,000 tool calls from our own day-to-day use of Claude Code and Codex. Our analysis shows that coding-agent workloads feature long autonomous loops, long contexts with short outputs, diverse and heavily-tailed tool calls, and high but imperfect prefix cache hit rates. These findings point to concrete opportunities for optimizing serving, including lower-overhead tool calling, append-length-aware prefill, semantic-aware tool-latency prediction, and improved KV-cache management around human-paced gaps. We release the dataset, trace collection pipeline, and analysis code at https://github.com/uw-syfi/TraceLab.git the project website is https://tracelab.cs.washington.edu.

0

cs.DC 2026-06-30

CFG tool from traces tests HPC I/O changes without code edits

by Zhaobin Zhu, Chen Wang +2 more

FBench: A Flexible Benchmark for CFG-Based What-If Exploration of HPC I/O Patterns

Reproduces real workload behavior and shows collective I/O can cut bandwidth 30x on Lustre.

abstract click to expand

The I/O performance of large-scale HPC applications depends on a complex interplay of access patterns, middleware optimizations, and file system configurations. To systematically explore these effects without repeatedly rerunning full applications, we introduce FBench, a flexible and code-transparent benchmarking tool for what-if analysis and I/O performance exploration. FBench leverages context-free grammars (CFGs) derived from Recorder traces to either generate simplified global configuration files for benchmark execution or replay I/O patterns on-the-fly without additional preprocessing. It supports both POSIX and MPI-IO interfaces and allows users to inject optimization hints via JSON configuration files, enabling rapid experimentation with I/O settings without code changes. Our evaluation shows that FBench accurately reproduces I/O behavior for both synthetic and real workloads, capturing access patterns and performance trends across diverse optimizations and file system settings. For IOR and HACC-IO, FBench closely matches scaling behavior and sensitivity to Lustre striping parameters. For FLASH Sedov, it reveals that collective I/O on Lustre can yield up to 30x lower write bandwidth than independent I/O, largely independent of striping, and that switching to a burst buffer file system increases non-collective write bandwidth by about 1.5x without additional tuning. The evaluation with LAMMPS shows that FBench can significantly reduce the time required for what-if analyses and, with simple tuning, enable improvements of up to 8x.

0

cs.DC 2026-06-29

HGST hard drives fail at 41% the rate of Seagate drives

by Christoph Siemroth, Yeomyung Park

Are There Manufacturer Differences in Hard-Drive Reliability?

Backblaze analysis controls for age, capacity, temperature and form factor to reveal manufacturer differences in HDD failure rates.

abstract click to expand

Based on the Backblaze hard disk drive (HDD) dataset, we analyze whether the four major HDD manufacturers represented in the dataset -- HGST, Seagate, Toshiba, Western Digital (WD) -- show differences in short- to medium-term HDD failure rates. Using two different duration regression models, we find -- holding constant drive age, capacity, form-factor, and drive temperature -- that Toshiba's failure rate is slightly above Seagate's. HGST HDD failure rates are the lowest, about 41% of Seagate's. WD HDD failure rates are significantly above HGST's, but still only about 52% of Seagate's. We also document the effects of age, capacity, temperature and drive location on failure rates.

0

cs.DC 2026-06-29

Workload and size decide best concurrent linked list

by Zeeshan Mohammed Rangrej

Five Ways to Build a Concurrent Linked From Coarse-Grain Locking to Lock-Free Algorithms

Benchmarks of five designs show coarse-grain and lazy win on small read-heavy lists while lock-free competes on large ranges and high thread

abstract click to expand

Linked lists are one of the most basic data structures in computer science. But when many threads try to use the same linked list at the same time, things get complicated. In this paper, we look at five different ways to make a linked list work correctly and efficiently with multiple threads running at once. We start with the simplest approach -- one big lock for the whole list -- and step by step improve it, ending with a lock-free design that uses no locks at all. We implemented all five versions in C++ and measured how fast each one is across different workloads (read-heavy, balanced, and write-heavy) and different list sizes. Our results show that the right choice of algorithm depends heavily on how the list is used: the coarse-grain and lazy lists win under read-heavy workloads with small key ranges, while the lock-free list becomes competitive when key ranges are large and more threads are running. Fine-grain locking, despite its theoretical appeal, pays a heavy cost from per-node lock overhead and consistently performs the worst in our tests.

0

cs.PF 2026-06-29

Kernel simulator predicts LLM latencies on new GPUs at 12.1% error

by Xiteng Yao, Taeho Kim +8 more

KernelSight-LM: A Kernel-Level LLM Inference Simulator

Cross-generation tier needs no target data and improves 1.8x over roofline for serving predictions.

abstract click to expand

As large language models (LLMs) move into production serving, practitioners must rapidly evaluate inference performance across diverse hardware, models, and serving parameters to meet cost and latency targets. However, the end-to-end behavior of LLMs couples serving-layer policies with low-level GPU kernel execution and rapidly evolving architectures, forcing slow, deployment-specific benchmarking that is hard to generalize. We present KernelSight-LM, a fine-grained inference simulator that models token-level execution and produces kernel-level latency breakdowns. It decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that also captures mechanisms like prefix caching and continuous batching. KernelSight-LM offers two prediction tiers that trade target-GPU data for accuracy. The cross-generation tier uses no target-GPU measurements, only hardware specifications and kernel microbenchmarks from previously profiled GPUs, and predicts per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline (22.0%). A second target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, sharpening per-kernel error to 3.8%, a 7.3x improvement over a comparable baseline (27.7%). Both tiers require far less target-GPU data than the prior systems they extend. In our simulator, these predictions yield end-to-end median (p50) errors across six model families of 15.4%, 12.8%, and 3.0% (TTFT, TPOT, throughput) in the cross-generation tier and 14.3%, 6.2%, and 2.7% in the target-measured tier, matching dedicated profiling tools while collecting far less on-device data. Beyond prediction, its kernel-level bottleneck breakdowns support hardware/software co-design and capacity planning.

0

physics.plasm-ph 2026-06-29

BIT1 extension scales PIC MC simulations to 800 GPUs with resilience

by Jeremy J. Williams, Stefan Costea +14 more

High-Performance Resilient Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations at Scale

Hybrid MPI+OpenMP framework adds load balancing and ADIOS2 checkpointing for uniform and non-uniform loads on Frontier, MN5, and LUMI-G.

abstract click to expand

The increasing demand for high-performance computing in plasma physics has driven scalable and resilient simulation methods capable of efficiently exploiting modern multi-GPU architectures. This work extends a portable hybrid MPI+OpenMP implementation of BIT1, focusing on high-performance resilience for accelerated Particle-in-Cell (PIC) Monte Carlo (MC) simulations under both uniform and non-uniform load conditions. Scalable particle load balancing and robust checkpoint/restart mechanisms across Nvidia and AMD accelerators are integrated with standardized I/O using openPMD and ADIOS2. This leverages BP4 for high-performance file-based checkpointing and SST for in-memory data streaming, enabling efficient data movement, resilient large-scale execution, seamless continuation from existing checkpoints, and effective handling of computational and I/O workloads. Advanced HPC profiling and tracing tools, including Nvidia Nsight Systems and AMD ROC-Profiler with Perfetto, provide detailed insights into computation, communication, and system-level behavior for optimization. Performance results on Frontier (OLCF-5), MN5, and LUMI-G demonstrate strong and weak scaling up to 800 GPUs, validating the framework for large-scale PIC MC simulations, while in-situ analysis and visualization using scalable I/O further enhance scientific insight without interrupting multi-GPU execution on current and future exascale systems.

0

cs.DB 2026-06-29

Single transaction spans multiple storage pools in DiStash

by Yiming Gao, Hieu Nguyen +2 more

DiStash: A Disaggregated Multi-Stash Transactional Key-Value Store

DiStash coordinates reads and writes on KV copies across DRAM, SSD, and HDD in one atomic step, avoiding separate operations that create inc

abstract click to expand

A stash is a storage medium such as Dynamic Random Access Memory (DRAM), Solid State Disk (SSD), Hard Disk Drive (HDD), or Non-Volatile Memory (NVM). This paper presents a disaggregated transactional key-value (KV) store, DiStash, that governs KVs cross pools of stash types. It enables an application to use a single transaction to read and write different copies of one or more key-value pair across the different pools of stashes. It simplifies the application logic by (a) preventing undesirable race conditions that may cause copies of data across different stash pools to reflect different values and/or (b) failures that may result in loss of key-value pairs. A configuration of DiStash may use a pool of stashes as either ephemeral or durable storage. The application dictates whether the content of its participating stashes are inclusive (replicated) or exclusive (tiered). We implement a DiStash by extending FoundationDB. We quantify the tradeoffs with its design decisions using microbenchmarks and eBay's production workload. We open source our implementation at https://github.com/ebay-USC/DiStash.

0

cs.PF 2026-06-29

Mixed precision trims 30 percent from simulation time and energy

by Gülçin Gedik, Robert Schöne +1 more

Mixed-Precision For Energy Efficient Computations

Reactor and hydrodynamics benchmarks keep accuracy while cutting both metrics.

abstract click to expand

As simulations grow more realistic, the pursuit of higher accuracy results in extended computation times and substantial power consumption. This study explores mixed-precision computing as a promising strategy to address these challenges, leveraging computer arithmetic tools to optimize performance. Using Reactor Simulator and LULESH benchmarks as case studies, we evaluated the potential of mixed-precision strategies to reduce both time-to-solution and energy-to-solution. For Reactor Simulator, we achieved a 30% reduction in both metrics without compromising accuracy. Similarly, for LULESH, results demonstrated up to a 30% improvement in time-to-solution and a 25% reduction in energy-to-solution.

0

cs.PF 2026-06-26

Cascaded routing keeps 97-99% LLM accuracy at lower cost

by Yasmin Moslem, Magdalena Kacmajor +12 more

Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving

Clustering assigns queries to cheap models and quality checks escalate hard cases, cutting time per output token.

abstract click to expand

Efficient deployment of large language models (LLMs) in production forces a trade-off between accuracy and cost. Operators often default to a single model that is either expensive for easy queries or insufficient for hard ones. To address this challenge, we propose a two-stage cascaded solution. Stage 1 clusters incoming queries and assigns each cluster to its most cost-effective model. The cost budget for this routing process is set by an interpretable hyperparameter, tuned offline. Stage 2 adds a quality estimation (QE) cascade; when an output from Stage 1 is judged low-quality, the query is escalated to a stronger model. This ensures only hard or low-confidence cases reach the expensive models. On the test datasets, the cascaded system retains 97-99% of the strongest model's accuracy while reducing Time Per Output Token (TPOT). It requires only task-correctness labels and adapts to changes in the model pool without manual reconfiguration.

0

cs.PF 2026-06-26

Zoning outweighs battery size for ESV fleet profit

by Peng Lin, Cheng Hua +2 more

On-Demand Service Zone Design for Energy-Constrained Spatial Queueing Systems

Energy-constrained hypercube analysis shows zone design must come before battery upgrades, with larger batteries sometimes lowering readines

abstract click to expand

Electric service vehicles (ESVs), such as mobile chargers and drone-based service units, are becoming an important operational resource for on-demand service systems. Unlike conventional spatial servers, ESV operations are shaped by battery limits and recharging needs, which affect dispatch feasibility and spatial deployment decisions. We develop an energy-constrained hypercube spatial queueing model that embeds battery-state dynamics into the classical hypercube framework and uses a semi-Markov representation to estimate steady-state performance. We then formulate a joint location--zoning problem for station placement and service zone design. The resulting large-scale mixed-integer nonlinear program admits a set partitioning reformulation whose column coefficients are not available in closed form. We therefore develop a Branch-Price-and-Evaluation framework for set partitioning problems with externally computable column coefficients: upper-bounding surrogates guide pricing, and iterative exact evaluation updates the coefficients of active columns. Computational results show that explicit energy modeling significantly reduces false service promises and yields more credible planning decisions. They also reveal a load-dependent reversal in zoning: pooling is preferable under light demand, whereas tighter zoning becomes more profitable as demand increases. Over the tested range, profitability is driven more by zoning than by battery improvement, suggesting that managers should get service zone design right before investing in battery upgrades; this caution is reinforced by the counterintuitive finding that larger batteries may delay replenishment and reduce fleet readiness under sparse demand. These findings show that energy feasibility is not merely a matter of battery-capacity expansion, but a design dimension that shapes service-zone configuration.

0

cs.PL 2026-06-26

Compiler automates approximation tuning for hyperdimensional computing

by Xavier Routh, Abdul Rafae Noor +5 more

Compiler-Driven Approximation Tuning for Hyperdimensional Computing

ApproxHDC searches the space of possible approximations to deliver performance gains on CPUs, GPUs, and memory accelerators with little accu

abstract click to expand

As Moore's law reaches its physical and economic limits, domain-specific approaches are increasingly employed to accelerate machine learning workloads. Hyperdimensional Computing (HDC) represents one such emerging paradigm, offering an alternative to conventional deep learning techniques. Rooted in cognitive models of computation, HDC is designed bottom-up with hardware efficiency as a first-class objective. HDC workloads map naturally to heterogeneous hardware platforms, including CPUs, GPUs, and FPGAs, as well as emerging in-memory computing technologies such as Resistive RAM (ReRAM) and Phase-Change Memory (PCM). HDC algorithms are intrinsically tolerant to noise and approximation, enabling substantial performance gains with minimal accuracy loss. In this work, we introduce ApproxHDC, a framework for automated identification and application of domain-specific approximations in HDC workloads. ApproxHDC extends the HPVM-HDC compiler infrastructure to enable retargetable compilation across diverse hardware backends, including CPUs, GPUs, and simulated ReRAM and PCM-based accelerators. The space of possible approximations is exponentially large; ApproxHDC employs efficient search and analysis to navigate it and identify high-impact configurations spanning both software and hardware levels.

0

cs.IR 2026-06-25

TileMaxSim scores 82M documents per second at 80% HBM bandwidth

by Ashutosh Sharma

TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization

Tiling and fused quantization let MaxSim read each embedding once, cutting ColBERT scoring latency by 98% on H100 GPUs.

abstract click to expand

Multi-vector retrieval models such as ColBERT achieve state-of-the-art accuracy through fine-grained token-level MaxSim scoring, yet existing GPU implementations leave most hardware performance unused. We give a roofline analysis of MaxSim on modern GPUs and identify a severe bandwidth gap: naive implementations reach only 5-18% of peak HBM bandwidth because they materialize the Nq x Nd similarity matrix, wasting memory traffic on data that is consumed once and discarded. We present TileMaxSim, a family of IO-aware Triton kernels that close this gap via (1) multi-query SRAM tiling that streams document embeddings through shared memory while accumulating per-query-token maxima in registers, reading each embedding from HBM exactly once; (2) dimension tiling that partitions the embedding dimension into 128-wide chunks, enabling scoring for d > 128 embeddings that overflow shared memory; and (3) fused product-quantization scoring via shared-memory lookup tables, cutting HBM I/O by up to ~31x. On NVIDIA H100 GPUs, TileMaxSim reaches 80.2% of peak HBM bandwidth and scores 82M documents/second (71.6M/s on real MS MARCO passages), a 220x speedup over loop-based scoring, 6.5x over fused PyTorch, 6.6-8.5x over torch.compile, and 469x the scoring throughput of WARP's CPU engine on the same node. TileMaxSim preserves exact retrieval quality: on MS MARCO and three BEIR benchmarks, rankings match reference MaxSim. As a drop-in replacement in ColBERTv2/PLAID, it cuts scoring latency at 100K candidates from 268 ms to 1.2 ms (98% lower end-to-end latency). We further show constant throughput from 100K to 500K documents, data-parallel multi-GPU sharding, robustness across dimensions 64-768, and FP16/BF16/FP32 support. Concurrent work independently develops an IO-aware fused MaxSim kernel; we differ in dimension tiling for d > 128 and fused product-quantization scoring.

0

cs.LG 2026-06-25

Auto-computes validated speed-of-light bounds from model code

by Qijing Huang, Sana Damani +10 more

SOLAR: AI-Powered Speed-of-Light Performance Analysis

LLM translates source to intermediate form for analytical computation of theoretical minimum execution times at varying detail levels.

abstract click to expand

How fast could a deep-learning model run on target hardware, and how far is today's implementation from that limit? These questions are central to software, hardware, and algorithm optimizations. Speed-of-Light (SOL) analysis answers them by computing a workload's theoretical minimum execution time on a given architecture. Yet deriving SOL bounds remains manual, error-prone, and disconnected from rapid model development. To close this gap, we introduce SOLAR, a framework that automatically derives validated SOL bounds from PyTorch and JAX source code. SOLAR leverages both generative and deterministic components in its flow: an LLM frontend translates any source programs into an executable Affine Loop IR, validated by output comparison; a deterministic flow lifts the IR into an einsum graph; and an analytical backend computes unfused, fused, and cache-aware SOL bounds. SOLAR provides comprehensive operator and language coverage, produces validated bounds with zero observed SOL violations, and offers multi-fidelity analysis that tightens bounds and surfaces optimization insights. We evaluate SOLAR across KernelBench, JAX/Flax models, and robotics workloads. These experiments demonstrate four use cases: headroom analysis at multiple fidelity levels, identifying optimization opportunities, cross-platform exploration, and inverse-roofline hardware provisioning.

0

cs.PL 2026-06-25

SMT verifies tensor transformations without rewrite rules

by Akash Kothari, Shaowei Zhu +2 more

Axon: A Synthesizing Superoptimizer for Tensor Programs

Axon synthesizes kernels for AI accelerators by propagating operators and checking equivalence over unbounded tensor domains.

abstract click to expand

Writing high performance kernels for AI accelerators requires deep expertise in tiling, instruction selection, data layout, and operator fusion placing a significant burden on programmers. In this paper, we focus on tile based AI accelerator programs and present Axon, a synthesizing superoptimizer for tensor programs: it uses program synthesis to automatically generate target instructions from semantics specifications, and explores semantically equivalent program variants to select the best performing kernel empirically. Axon discovers algebraic transformations by propagating operators through computation graphs and uses SMT over unbounded tensors to guarantee that all transformations preserve semantics without requiring hand crafted rewrite rules. It then lowers tensor operations to target ISA instructions, explores tiling configurations constrained by hardware descriptions, and fuses operators and instructions to minimize memory traffic.

0

cs.DC 2026-06-25

Fused kernels reach 83% of INT8 peak for emulated high-precision GEMM

by Denghui Lu, Alexander Maeder +2 more

EmuGEMM: Fused Tensor Core Kernels for Precision Emulation in Matrix Multiplication

By keeping Ozaki intermediates on-chip, EmuGEMM beats cuBLAS TF32 by up to 1.7x on Hopper and Blackwell at matching accuracy.

abstract click to expand

Modern GPUs devote an increasing silicon budget to low-precision matrix-multiplication units, widening the precision-throughput gap for scientific computing workloads. Ozaki Schemes I and II offer an alternative by reconstructing high-precision general matrix multiplication (GEMM) from low-precision operations, yet existing implementations leave substantial performance untapped. In particular, intermediate results are repeatedly materialized in global memory, making data movement the dominant bottleneck. We present EmuGEMM, fused integer Tensor Core kernels for NVIDIA Hopper and Blackwell GPUs that eliminate redundant memory round-trips in both Ozaki schemes. Using Scheme I, EmuGEMM sustains up to 1,639 Top/s on Hopper (83% of INT8 peak) and 3,654 Top/s on Blackwell (81%). For large matrices, EmuGEMM surpasses cuBLAS TF32 throughput by up to 1.4x on Hopper and 1.7x on Blackwell, at comparable accuracy. Using Scheme II, EmuGEMM extends to complex arithmetic and outperforms cuBLAS ZGEMM by up to 2.3x on Hopper and 5.5x on Blackwell.

0

cs.PF 2026-06-25

Direct AMX kernel tops Accelerate by 1.17x on M1 LLM prefill GEMMs

by Deyvik Bhan

Above the Inner Loop: Exceeding Accelerate at LLM Prefill GEMM on the M1 AMX

Bit-exact fp32 path using panel threading and weight pre-packing wins all twelve shapes at S=128 and lifts llama.cpp throughput 1.44x.

abstract click to expand

On Apple Silicon the fp32 GEMMs dominating LLM prefill are dispatched by Accelerate to a matrix coprocessor (AMX) on the M1-M3. We ask where a hand-written kernel's throughput over Accelerate comes from on the M1 AMX, and reach a structural conclusion: not a faster inner loop. By microbenchmark the inner loop is load-issue bound -- once any operand load interleaves with the FMA32 stream, single-thread throughput falls to a 610-to-680 GFLOPS band, under half the load-free rate. The gain comes from two deployment-level levers Accelerate underuses: fine multi-thread panels filling the M1's second on-chip AMX block (winning the K >= N shapes), and pre-packing the constant weight at load (winning the N > K shapes). A bit-exact direct-AMX kernel using both is the fastest bit-exact fp32 GEMM path we find on the M1: it exceeds all three Accelerate fp32 paths (cblas_sgemm, BNNSMatMul, and the BNNS Graph compiler) at all twelve LLM prefill GEMMs at S = 128 (GPT-2 to Llama-7B scale), leading the fastest, BNNS Graph, by 1.17 -- and by 1.09 at the three shapes where it too holds fp32 -- with geometric means of 1.58 over BNNSMatMul and about 2.0x over cblas_sgemm. Every output is bit-identical to Accelerate, whereas BNNS Graph is bit-exact at only three of twelve shapes, the rest at reduced precision (error up to 1.4e-3). Dropped into llama.cpp for its cblas_sgemm prefill matmuls, it raises measured full-forward throughput from 291 to 420 tokens/s (1.44x, bit-identical) at 128-token prefill -- end-to-end, not a GEMM-only ratio. The contribution is this shape-resolved M1-AMX characterization (microbenchmark, two-block aggregate, per-core occupancy probe), leaving fine-panel scheduling and pre-packing as the only two levers above an inner loop at the hardware limit; mis-tuning the single column-panel width costs nearly 2x.

0

cs.DC 2026-06-24

AI clusters adjust power use to grid conditions on demand

by Chris Williams, Philip Colangelo +16 more

Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute

130 kW real-world tests show rapid reductions and load shifting while keeping priority jobs on track

abstract click to expand

The rapid expansion of artificial intelligence (AI) infrastructure is driving unprecedented growth in electricity demand from data centers. Traditional power-system planning treats large computing facilities as inflexible peak loads, leading to costly infrastructure upgrades and long delays in grid interconnection. Recent work has shown that AI clusters can reduce electricity consumption during peak demand through software-based workload orchestration. This article explores how modern GPU-based AI data centers can operate as grid-interactive assets that respond dynamically to power system conditions. We describe an architecture integrating grid signals, workload scheduling, and power telemetry for fine-grained cluster power control. Experimental results from a real-world deployment on a 130 kW GPU cluster demonstrate multiple forms of flexibility, including rapid load reduction, sustained curtailment, and carbon-aware operation while preserving service levels for priority jobs. We further demonstrate performance-aware load shifting across geographically distributed clusters, enabling workloads to migrate toward regions with lower grid stress. Together, these capabilities transform AI infrastructure from static electricity consumers into flexible resources that support grid reliability, accelerate interconnection, and improve computing sustainability.

0

math.PR 2026-06-24

Shifted empirical samples give asymptotically optimal Gittins scheduling

by Nicolas Gast, Bruno Gaujal +1 more

Scheduling jobs with unknown size distribution in a M/G/1 queue: the shifted empirical Gittins

For M/G/1 queues with unknown bounded job sizes, n samples yield indices whose policy matches optimal response time in the large-n limit.

abstract click to expand

In this paper we consider a M/G/1 queue for which we want to minimize the expected response time. We show how to compute indices from $n$ samples of the job size distribution such that the corresponding index policy is asymptotically optimal as $n$ grows. This construction is based on a discretization of the bounded support of the job size distribution and a shift of the samples to their nearest discrete point to the right. We show that the Gittins index of the empirical distribution of these shifted samples is close to the Gittins index of the original distribution. This translates to the asymptotic optimality of the corresponding index policy for minimizing the expected response time. Numerical comparison with other approaches further confirm the efficiency of our approach.

0

cs.CL 2026-06-24

LLM prompts beat NER on Brazilian product attribute extraction

by Murilo Gazzola, Hugo Gobato Souto +5 more

AI-PAVE-Br: Leveraging Large Language Models for Enhanced Product Attribute Value Extraction through a Golden Set Approach

AI-PAVE-Br and the new Golden Set dataset raise accuracy for Portuguese e-commerce catalogs.

abstract click to expand

The explosive growth and complexity of product data within the dynamic Brazilian e-commerce landscape demand robust and specialized methods for structured information extraction. Traditional approaches to Product Attribute Value Extraction (PAVE) often struggle with the linguistic nuances and sheer diversity of product descriptions in Portuguese. To address this critical gap, this paper introduces two major contributions. First, we present AI-PAVEBr, a specialized system engineered with Large Language Models (LLMs) to perform high-accuracy PAVE specifically for Brazilian e-commerce catalogs. Second, to facilitate reproducible research and provide a definitive benchmark, we introduce and share the Golden Set, a new, meticulously curated, and manually annotated dataset for PAVE in Portuguese. We detail the creation process and structure (Entity, Category, Subcategories) of this high-quality reference set. Our experiments conclusively show that AI-PAVE-Br, leveraging targeted prompt engineering, dramatically outperforms conventional Named Entity Recognition (NER) baselines. This work not only delivers a superior, scalable solution for a major non-English market but also enriches the NLP community with a valuable, publicly available resource for future PAVE research.

0

cs.DC 2026-06-24

Weight-KV disaggregation cuts P99 TBT 10.4x for cold MoE

by Zhuoren Ye, Tianyu Wo +5 more

CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

Two separate GPU pools let cold models share KV capacity on aggregate demand instead of reserving per-model peaks.

abstract click to expand

Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely reach peak KV-cache demand at the same time, reserving worst-case KV capacity per model wastes memory; a shared KV-cache pool can instead provision aggregate active demand. However, KV-cache sharing is not sufficient when weights and KV-cache remain in a monolithic GPU memory pool. Static weights compete with dynamic KV-cache, and KV-head-limited attention under cold, low-concurrency traffic exposes only a fraction of replicated KV capacity, leading to low GPU memory utilization and weak long-context support. We present CrossPool, a serving engine for cold MoE models that separates FFN weights and KV-cache into two GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a KV-cache pool that dynamically serves active requests while keeping attention local to KV-cache. CrossPool combines a KV-cache planner and virtualizer, a layer-wise pipeline scheduler that hides hidden-state transfers, and persistent kernels with control lowering to reduce CPU-GPU control overhead. With efficient GPU memory pooling, CrossPool underpins bursty long-context requests and outperforms the state-of-the-art kvcached-based multi-LLM serving system, reducing P99 TBT by up to 10.4x.

0

cs.AI 2026-06-24

DigenRL delivers 1.56-2.1x throughput for diffusion RL

by Sijie Wang, Zhengyu Qing +7 more

Accelerating Disaggregated RL for Visual Generative LLMs with Diffusion-Based Parallelism and Trainer-Assisted Generation

Generation-axis pipelining and trainer-assisted rollout cut bubbles in disaggregated setups for visual generative models.

abstract click to expand

Reinforcement learning (RL) has become a dominant post-training paradigm, driving the emergence of high-performance RL systems such as veRL for autoregressive large language models (LLMs). In parallel, diffusion-oriented RL algorithms, e.g., DanceGRPO and FlowGRPO, have rapidly expanded the scope of RL from language reasoning to diffusion-based visual and flow-based generation. However, efficient RL systems for diffusion generative LLMs remain underexplored. Existing implementations, e.g., veRL-Omni, still rely on colocated execution, which simplifies synchronization but couples rollout and training resources, limits heterogeneous deployment, and constrains independent scaling. To this end, we introduce DigenRL, a disaggregated RL framework for diffusion-based generative LLMs that supports flexible resource allocation, accommodates heterogeneous GPUs, and facilitates efficient task scheduling. To maximally reduce the execution bubbles in the disaggregated architecture, we propose: 1) a generation-axis pipeline (GAP) and time-step parallelism (TSP) in the diffusion architecture to enable finer-grained pipelining between rollout and training; 2) an elastic trainer-assisted generation (TAG) approach to enable the trainer GPU resources to dynamically assist in executing rollout generations; and 3) a tightly one-step constrained asynchronous strategy to further utilize the tail bubble in the pipeline. Extensive experiments are conducted on three hardware testbeds with 16-32 GPUs using HunyuanVideo-13B, Wan2.1-14B, FLUX.1-12B, and QwenImage-20B generative models. Experimental results show that DigenRL achieves 1.56-2.10x throughput improvements over state-of-the-art diffusion RL systems, veRL-Omni and GenRL.

0

cs.DC 2026-06-23

VM-GPU bridge cuts confidential LLM throughput 13-27%

by Hang Yin, Kevin Wang

The Serialized Bridge: Understanding and Recovering LLM Serving Performance under Blackwell GPU Confidential Computing

Serialized secure data movement, not GPU compute, explains the loss on Blackwell platforms under TDX.

abstract click to expand

GPU Confidential Computing (GPU-CC) now preserves GPU-local performance: on NVIDIA B300, BF16 matmul runs at 0.998x of non-confidential performance. Yet LLM serving under Intel TDX plus GPU-CC still loses 13-27% of throughput, and KV-cache restore latency can more than double. This paper studies that gap on two Blackwell platforms, RTX Pro 6000 and B300 HGX, and identifies its dominant cause: the confidential VM-GPU bridge, not GPU compute. We find that GPU-CC turns host/device movement into a serialized, high-setup-cost channel. Secure copies do not gain CUDA-stream concurrency within a context, asynchronous transfers block at the runtime boundary, and small crossings pay a fixed toll. This violates the assumptions of modern inference runtimes, where DMA is expected to be cheap, concurrent, and asynchronous. In vLLM dense decode, the gap closes around 44x-slower small alloc-and-copy operations; targeted patches reject alternative explanations. A scheduling flag recovers 57% of the gap, while a worker-thread drain recovers up to 92% in qualified high-concurrency runs. The same bridge model explains a +131% KV-restore penalty and a 34x model-load slowdown. Blackwell also changes the confidential tenancy unit. We qualify confidential multi-GPU NVSwitch tenants on B300, including 510 GB/s NVLink P2P inside a CVM and concurrent isolated tenants, and identify the remaining fabric-attestation gap for production confidential AI platforms.

0

cs.PF 2026-06-23

Master core LMS regulator cuts memory slowdowns over Memguard

by Sudarshan Srinivasan, Deepak Gangadharan +1 more

LMS-AR: LMS Prediction-based Adaptive Regulator for Memory Bandwidth in Multicore Systems

Prediction from outside the regulated cores lowers contention effects on SPEC benchmarks by enforcing per-core bandwidth allocations.

abstract click to expand

Memory bandwidth contention in multi-core systems severely impacts application performance and quality-of-service (QoS) guarantees. Regulating the shared memory bandwidth mitigates the memory performance uncertainty thereby making it a manageable resource and improving trustworthiness of multi-core systems. In this work we propose a memory bandwidth regulation mechanism LMS-AR, i.e., LMS Prediction-based Adaptive Regulator within a Linux kernel module to distribute the memory bandwidth as a resource among the CPU cores. We describe a design in which both monitoring and regulation is enforced from outside by a master core - which is not a dedicated controller for regulation. This allows for plugging in computationally heavy prediction and regulation algorithms without interfering with the regulated core. An adaptive filtering technique was employed for prediction of per-core bandwidth requirement. We conducted several experiments with SPEC CPU 2017 benchmarks distributed across multiple cores. Our proposed approach demonstrated significant improvement over Memguard with respect to slowdown ratios caused due to memory contention. Our solution is hosted publicly at $\href{https://github.com/ss22ongithub/LMSAdaptiveRegulator}{https://github.com/ss22ongithub/LMSAdaptiveRegulator}$.

0

cs.DC 2026-06-23

Struct splitting trims SPH GPU packing time by 20-40%

by Mladen Ivkovic, Abouzied M.A.Nasar +5 more

Memory Layouts for GPU-Data Transfer Buffering in SPH

Access-pattern decomposition of particle data lowers total offloading overhead by 12-25% as transfers dominate runtime.

abstract click to expand

The rise in GPU compute speed has outpaced improvements in host-to-device memory transfer speeds, despite the advent of shared-memory superchips. Consequently, memory transfer times now constitute an increasingly large fraction of total time-to-solution, compelling developers to compress GPU kernel input and output data into compact, minimal formats prior to GPU-offloading. This complements existing work on GPU- and compute-friendly data arrangements. We study a Smoothed Particle Hydrodynamics solver and propose memory layout strategies for host-side particle data that are particularly well-suited to GPU-offloading. Specifically, we advocate splitting classic array-of-struct data structures into a split array-of-struct arrangement, in which each logical struct decomposes into substructs determined by kernel read/write access patterns and attribute types. Splitting a monolithic particle struct into several bespoke, finer-grained structs can reduce the time required to pack data to and from buffers by ~20% - 40%, lowering total time spent on GPU-offloading by ~12% - 25%.

0

cs.DC 2026-06-23

GPU offload delivers up to 15x energy efficiency on flagship codes

by Salvatore Cielo, Elmira Birang +6 more

Node-Level Performance and Energy Characterization of Flagship Science Applications on SuperMUC-NG Phase 2

Single-node tests on SuperMUC-NG Phase 2 find 4-12x throughput gains for molecular dynamics and astrophysics workloads, but gains shrink wit

abstract click to expand

We present a systematic performance and energy-efficiency characterization of five flagship scientific workloads on SuperMUC-NG phase 2, the 28 PetaFLOPs system at the Leibniz Supercomputing Center (LRZ) equipped with Intel Xeon Platinum 8480+ and Intel Data Center GPU Max 1550 (Ponte Vecchio, PVC) accelerators. The selected codes span molecular dynamics (gromacs, lammps), astrophysics and cosmology (OpenGadget3, AthenaK), and finite-element PDE solvers from the dealii-X Center of Excellence. For each code we measure throughput and energy efficiency expressed as compute-elements per wall-clock second (or per Joule of consumed energy) on a single compute node, comparing CPU-only (SPR) against combined CPU+GPU (SPR+PVC) configurations where available. Energy measurements rely on lightweight code instrumentation with p3em, or the Energy Aware Runtime (EAR) present on the system. Our results show that GPU offload yields $4-12\times$ higher throughput and up to $15\times$ better energy efficiency compared to CPU-only execution, with lammps and AthenaK benefiting most. However, both throughput and energy gains are sensitive to problem granularity: insufficient work per GPU tile erodes the accelerator advantage, as clearly observed in AthenaK at small mesh-block sizes. The power-budget utilization is systematically lower for CPUs than it is for GPUs, indicating that even at peak useful-work rate, most applications running on CPUs leave a significant fraction of the node's thermal envelope unused.

0

cs.AI 2026-06-23

Bloom filter counters supply certainty to ML models

by Yuval Banoun, Daniel Sadoc Menasche +1 more

Learning Filters with Certainty

Counting Bloom Filters keep numbers instead of bits; those numbers improve accuracy when passed to combined machine learning systems.

abstract click to expand

Hash-based data structures such as Bloom filters are widely used in network systems for tasks including caching, anomaly detection, and machine learning pipelines. They typically provide binary indications of whether an element belongs to a set of interest, e.g., the contents of a cache. When uncertainty arises due to hash collisions, a positive indication is returned to avoid false negatives. We argue that the certainty associated with such indications can itself be useful information. This work focuses on Counting Bloom Filters (CBFs), a Bloom-filter variant that maintains counters rather than bits. Besides supporting insertions and deletions, these counters provide additional information that can be used to estimate the certainty of positive membership indications. We show how this certainty signal can be exploited in architectures that combine Bloom Filters with machine learning (ML) models.

0

cs.PF 2026-06-22

Enriched prompts lift edge LLM accuracy to 89% on outdoor sensor data

by Aygün Varol, Katarzyna Ko{l}odziej +6 more

Enabling Cloud-Level Accuracy in Edge AI through IoT Data Preprocessing

Preprocessing raw readings into text descriptions lets local models handle air quality and comfort queries at 0.22s latency

abstract click to expand

Large language models (LLMs) offer a natural-language interface for interpreting Internet of Things (IoT) sensor data in smart environments; however, cloud deployment introduces latency, privacy, and connectivity concerns. Local LLMs can reduce these limitations, but compact edge-deployable models often show weaker numerical reasoning when raw sensor readings are provided directly. This paper investigates whether prompt-side preprocessing can improve the accuracy-latency trade-off of local LLMs for environmental monitoring. We propose a structured prompt construction framework that transforms raw air-quality and thermal-comfort measurements into progressively enriched textual representations: raw sensor values, threshold-aware descriptions, and compact environmental summary flags. The approach is evaluated using indoor Raspberry Pi/BME680 datasets from Tampere University and outdoor air-quality datasets from Helsinki, Katowice, and Warsaw. We construct a binary LLM query dataset covering air quality, thermal comfort, and joint environmental conditions, and evaluate five local and five cloud LLMs across three prompt variants and two inference modes, with and without chain-of-thought prompting. Results show that prompt enrichment substantially improves local-model accuracy. In No-CoT mode, local accuracy increases from 50.9% to 81.7% indoors and from 63.7% to 89.3% outdoors from the raw to the most enriched prompt. Local No-CoT inference is the fastest configuration, with mean latency close to 0.22 s, while CoT substantially increases inference time. These findings suggest that lightweight prompt-side preprocessing can narrow the local--cloud performance gap and support low-latency IoT analytics in smart environments.

0

cs.DB 2026-06-22

Decode throughput stays constant across bit widths

by Madhulatha Mandarapu, Sandeep Kunkunuru

When Is a Columnar Scan Bandwidth-Bound? A Decode-Throughput Law and Its Cross-Hardware Validation

A one-parameter law predicts the bandwidth fraction columnar scans achieve on x86 and Apple silicon

abstract click to expand

A columnar scan that decompresses, filters, and aggregates should be limited only by memory bandwidth (the roofline floor T >= BytesRead/beta), yet real kernels are often compute-bound and leave bandwidth idle. We give a predictive answer to when a scan is bandwidth-bound. Across encodings, predicate selectivities, and two very different machines, a decoder's value throughput T_dec (values decoded per second) is essentially independent of bit-width b: it is set by the decode layout/strategy, not by how many bits each value occupies. Hence the achieved bandwidth fraction obeys a one-parameter law, f = min(1, T_dec * b / (8*beta)), with the compute-to-bandwidth ridge at b* = 8*beta/T_dec. Fitting one T_dec per strategy reproduces measured bandwidth fractions with median error 0.027 on x86/AVX2 and 0.003 on a held-out Apple M4/NEON machine, and the ridge b* shifts correctly with each machine's bandwidth. Inserting FastLanes' reported decode throughput into the law reproduces its "decode is free at three bits" headline as the large-T_dec limit, unifying our portable decoder and hand-tuned state of the art in one curve. We add two crossovers, validated on both machines: branch-free predicate evaluation beats branchy in a mid-selectivity band (the sigma(1-sigma) misprediction parabola), and zone-map skipping is clustering-gated rather than selectivity-gated. We release the micro-benchmark, the correctness oracle, and a one-command reproduction. This is a baseline and a model, not a faster kernel: our portable C decoders reach ~2 values/cycle, far below hand-tuned SOTA, and the law holds precisely because it is parameterized by the measured T_dec.

0

cs.AR 2026-06-22

Reverse engineering maps Apple Neural Engine internals

by Spencer H. Bryngelson

Apple Neural Engine: Architecture, Programming, and Performance

Datapath, compiler format, weight compression, and command protocol detailed across A11 to A18 and M1 to M5 chips.

abstract click to expand

The Apple Neural Engine (ANE) is the fixed-function matrix accelerator that has shipped in Apple systems-on-chip since the A11-class iPhone and iPad chips and the M1-class Mac chips, exposed to applications only through the Core ML model framework. This guide reports a reverse-engineered account of the engine, based on direct measurement on Apple silicon and static analysis of the private runtime, compiler, kernel driver, and firmware. It documents the datapath and the roofline that bound the engine's throughput and energy, the dispatch route that reaches it below Core ML, the compiler and on-disk program format, the weight-compression scheme, and the kernel driver, firmware, and command protocol beneath them. The account covers the A11 through A18 and M1 through M5 families, with per-chip target tables and an operation-by-device matrix; the direct measurements are on the M1 and M5. Claims are labeled as measured, decompile-derived, or predicted, and the methodology and open questions are recorded. The direct route is callable from ordinary user space but remains undocumented, unsupported, and version-fragile; it is intended for measurement, research, and on-device work, not for shipping software, where Core ML remains the supported path.

0

cs.LG 2026-06-22

Recorded traffic calibration cuts ML serving error to 2-6%

by Amr S. Abdelfattah, Nakul Tirumalai +5 more

Load Testing for Machine Learning Model Serving Systems at Scale

Adaptive load testing framework in 14 case studies reduces under-provisioning incidents and improves GPU use.

abstract click to expand

Machine learning (ML) model serving has become a dominant consumer of GPU infrastructure, yet capacity planning in these systems remains largely ad hoc. Under-provisioning leads to service-level objective (SLO) violations and production incidents, while over-provisioning results in substantial resource waste. This paper presents \sys, an industrial load testing framework for ML serving systems that systematically estimates serving capacity through an adaptive, feedback-driven search strategy. The approach leverages real-time performance signals, incorporating dampening, spike tolerance, and convergence detection to efficiently identify maximum sustainable throughput under SLO constraints. We evaluate \sys through a longitudinal analysis of 14 industrial case studies spanning four ML architecture classes: recommendation, ranking, vision, and NLP. This study demonstrates that systematic load testing leads to substantial improvements in GPU resource efficiency and operational reliability. Prior to adopting \sys, a significant fraction of model launches were under-provisioned, resulting in recurring incidents; these issues were substantially reduced after deployment. Our results show that ML-specific design decisions are critical to accurate capacity estimation: workload calibration using recorded traffic reduces estimation error from approximately 30\% to 2--6\%, while proper warmup handling yields a 22.2\% improvement in accuracy. Further analysis reveals key factors influencing prediction error, including model size and co-location effects. This paper distills six lessons and derive architectural guidelines for ML load testing, offering actionable insights for building reliable and efficient ML serving systems.

0

cs.DC 2026-06-22

Shared-memory pattern hits 54.7B market events per second

by Shakya Jayakody, Prarthinie Jayakody

KineticSim: A Lightweight, High-Performance Execution Engine for Real-Time Market Simulators

Persistent state in thread blocks and cooperative clearing cut critical path from linear to log-plus-ceil and remove per-step global writes.

abstract click to expand

Simulating financial markets at scale with multi-agent (Agent-Based) models is critical for market design, regulatory stress-testing, and reinforcement learning, but traditional CPU simulators are bottlenecked by sequential processing while vectorized GPU frameworks suffer from kernel-launch overhead and redundant global-memory round-trips. We formalize, analyze, and evaluate a reusable parallel design pattern: persistent, state-carrying clearing for iterative multi-agent reductions. By caching mutable simulation state in thread-block shared memory across step boundaries, aggregating agent actions via shared-memory atomics, and resolving the clearing function cooperatively, the pattern reduces the per-step critical-path depth from Theta(L+A) for sequential clearing (L price-grid ticks, A agents) to Theta(log L + ceil(A/L)) and makes global-memory traffic independent of the step count. We implement this in KineticSim, a lightweight GPU execution engine that simulates massive ensembles of limit-order books in parallel, reaching a peak throughput of over 54.7 billion agent-events per second. On a fixed workload it delivers speedups of 3406x over CPU (NumPy), 27.8x over PyTorch GPU, 42.8x over JAX GPU, and 8.4x over a naive custom CUDA baseline, while using roughly an order of magnitude less GPU memory than PyTorch. Across 53 configurations the two custom CUDA engines produce bitwise-identical order books, and aggregate statistics match the CPU reference to within 0.1%. The pattern generalizes to other iterative multi-agent workloads requiring state-persistent, block-localized reductions.

0

cs.PF 2026-06-22

MoE models run slower than dense ones on edge hardware

by Alfarizy Alfarizy, Hung Truong Thanh Nguyen +3 more

Does Mixture-of-Experts Actually Help Inference on Consumer and Edge Hardware? An Empirical Study

Benchmarks find total parameter count, not active parameters, sets inference cost when memory bandwidth is the limit.

abstract click to expand

Mixture-of-Experts (MoE) language models are often described as ideal for resource-constrained inference. Each token activates only a small subset of experts, so the per-token compute cost, in floating-point operations (FLOPs), resembles that of a much smaller dense model. Whether that FLOP advantage survives in practice is far less clear. We ask whether MoE models actually run faster and cheaper than comparable dense models on consumer-grade and edge hardware. We benchmark OLMoE-1B-7B (1.3 B active of 6.9 B total) against three dense baselines on an Apple M2 Pro and an NVIDIA Jetson Orin Nano 8 GB through \texttt{llama.cpp}, measuring throughput, memory, and on-device energy. The answer is device-dependent: OLMoE's active-parameter advantage is only partly realised on the laptop (~10% behind the same-active Llama-3.2-1B) and erodes on the edge device (~31% behind, at 2.1$\times$ the energy per token, with peak memory at the 8 GB ceiling). Patching \texttt{llama.cpp} to time the decode graph node-by-node shows routing accounts for under 9% of MoE-block compute on the cleaner edge backend, so the gap reflects total-parameter memory footprint, expert dispatch, and KV-cache pressure rather than routing. The implication is that on bandwidth-bound edge hardware, inference cost tracks total parameters, not active ones, and sparse activation does not buy back what the device is constrained on. These findings are bounded to one MoE model at this parameter scale and two devices, and we release the full measurement harness and per-run data.

0

cs.LG 2026-06-19

4-bit KV cache cuts agent TTFT by 3.47x in late rounds

by Inesh Chakrabarti, David Limpus +10 more

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

Yields 2.3x overall first-token speedup and 1.63x higher output rate versus FP8 on multi-turn long-context tasks

abstract click to expand

Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.

0

cs.PF 2026-06-19

FP16 rounding leaves SparseStack sketch quality unchanged

by Aryaman Jeendgar, Clément Flint +1 more

Randomized Sketching is Robust to Low-Precision Rounding on GPUs

Across incoherent, coherent and adversarial inputs the sketch distribution, not the quantization rule, determines embedding accuracy.

abstract click to expand

Randomized sketching is a core primitive in randomized numerical linear algebra. On modern hardware architectures, in particular on GPUs, the performance of sparse sketches is limited by memory traffic and atomic accumulation rather than floating-point throughput. This makes sketching a natural target for mixed precision, provided that low-precision accumulation does not degrade the embedding quality. We study mixed-precision GPU implementations of sparse oblivious subspace embeddings, focusing on a SparseStack generalization of the GPU CountSketch kernel of Higgins et al. SparseStack improves embedding quality relative to CountSketch on coherent inputs, but its additional nonzeros per column increase atomic-update contention and reduce throughput. We therefore implement FP16 SparseStack variants using deterministic round-to-nearest, exact stochastic rounding, and dithered rounding, and compare them with FP32 SparseStack, CountSketch, mixed-precision CountSketch, and FlashSketch. Our main empirical finding is that, for the tested regimes, SparseStack embedding quality is insensitive to the FP16 rounding rule. Deterministic, stochastic, and dithered rounding FP16 SparseStack produce nearly identical subspace distortion and sketch-and-solve least-squares accuracy across incoherent, coherent, and adversarial test problems. The dominant accuracy factor is the sketch distribution rather than the quantization rule: SparseStack variants substantially improve distortion on coherent inputs, while all methods behave similarly on incoherent inputs. Since deterministic rounding has the lowest overhead, it provides the best performance--accuracy tradeoff among the FP16 SparseStack variants.

0

cs.DB 2026-06-17

Greedy log flush matches best tuned timer above load threshold

by Madhulatha Mandarapu, Sandeep Kunkunuru

Group Commit Self-Clocks: Why Tuning Is Unnecessary Above a Device-Set Load Threshold

Closed-loop client behavior makes the optimal wait time fall below flush cost so tuning adds no value

abstract click to expand

Group commit amortizes the fixed cost of a durable log flush across many committing transactions; the release rule - a timer, a batch size, or an adaptive policy - is a classic tuning knob. The textbook theory is open-loop: for Poisson arrivals the optimal timer is the EOQ square-root rule, and the wait-or-flush decision is ski-rental 2-competitive. We ask when that tuning is worth its machinery, and show that in closed-loop OLTP it usually is not. Real commit arrivals are closed-loop: a client issues its next transaction only after its last commits, so the arrival rate is induced by the policy's own latency. Modeling this as a closed queueing network, the parameter-free greedy-pipelined policy (flush the instant the device is free) self-clocks to a computable fixed point and is within about 0.1% of the best oracle-tuned timer at every load. The square-root rule prescribes waiting $T^\star=\sqrt{2F_0/\lambda}$, but $T^\star<F_0$ exactly when $\lambda>\lambda^\star=2/F_0$; above this device-set load threshold the timer collapses onto greedy and tuning is vacuous. The clean theory only bites below $\lambda^\star$ and in the open-loop world, where a parameter-free ski policy still beats a fixed tuned timer under rate shifts. We instantiate $\lambda^\star$ with measured fsync distributions on two AWS storage classes (EBS gp3 versus instance NVMe, a $25\times$ range), and confirm on PostgreSQL that commit_delay=0 is competitive with any tuned value. The contribution is a characterization that explains deployed practice; we add no new logger.

0

quant-ph 2026-06-17

Protocol sets optimal activation times for quantum network links

by Vinay Kumar, Claudio Cicconetti +2 more

Optimal Calibration of Quantum Network Links

Analytical method for linear repeater chains meets any end-to-end fidelity target by balancing each link's uptime against calibration downti

abstract click to expand

The reliable distribution of entanglement is essential for the effective operation of quantum networks. Due to fundamental differences between quantum and classical communication systems, it is necessary to develop specialised algorithms and protocols that also account for quantum-specific constraints. In this work, we focus on the issue of recalibration. As suggested by recent experimental studies, the process of local entanglement generation in a quantum link degrades over time due to environmental changes that have to be estimated and compensated via a calibration operation, during which the link is not available. Therefore, in such a quantum network, every link alternates between an activation period, during which it operates normally, and a calibration period, during which it cannot participate in the end-to-end entanglement distribution, thereby creating a trade-off between link quality (the fidelity of generated pairs, which decays during activation) and availability (the fraction of time the link is usable, which calibration reduces). We develop analytically a protocol for optimally assigning activation periods to each link in linear quantum repeater chains, subject to any general end-to-end fidelity requirements and local initial fidelity thresholds. Building on this foundation, we extend to general quantum networks, where multiple paths may cross at common links, proposing a heuristic approach evaluated in simulations and compared with a benchmark, numerical approach, and theoretical bounds.

1 0

0

cs.PF 2026-06-16

Consistent contrast estimators fix benchmarking in stateful systems

by Gábor Melis

The Right Call for Software Benchmarking: Consistent Decisions in Stateful Environments

Program-specific biases cancel in simple experiment designs, enabling correct identification of the fastest program without modeling dynamic

abstract click to expand

In the perpetual pursuit of performance, modern computing systems rely ever more on stateful mechanisms to accommodate the dynamics of workloads and physical environments, bolstering efficiency but confounding benchmarking and thereby the optimization of software. Indeed, by their nature, adaptive mechanisms introduce temporal dependencies between measurements and render naive estimators of individual program performance biased. Observing that rectifying such biases necessitates speculative assumptions about system dynamics, we call for prioritizing performance differentials over absolute measures and formalize software benchmarking as the decision problem of identifying the fastest program, for which relative knowledge suffices. To this end, we propose simple experiment designs admitting consistent estimators of contrasts, whereby program-specific biases cancel under tenable assumptions. These designs asymptotically yield the correct decision and afford a robust methodology for finite-budget benchmarking in stateful environments, bearing broad implications for the development of performance-sensitive software.

0

astro-ph.IM 2026-06-16

GPU OpenGadget3 matches CPU results in all tests

by A. Ragagnin, G. S. Karademir +7 more

OpenGadget3 GPU solver tests

Gravity, hydro, cluster and galaxy runs agree within small-scale noise while delivering 2-3x chip-to-chip speedup.

abstract click to expand

We present an in-depth evaluation of the scalability and accuracy of the GPU porting of the N-body code for hydrodynamic cosmological simulations \og. While technical details of our GPU porting were presented in Ragagnin et al. (2020), in this work we focus on assessing the accuracy of the ported modules: the short range gravity integrator, the different components of the hydrodynamic solver, and the conjugate gradient solver for thermal conduction. We ran several tests that gradually increase the number of physical modules included: a gravity-only cosmological simulation; a hydrodynamical shock tube test; a non-radiative zoom-in simulation of a galaxy cluster in a cosmological box; and a full-physics zoom-in simulation of a galaxy in a cosmological box. Comparing the results obtained with the GPU implementation to those from the classical CPU version, we find excellent agreement across all tests, with small differences on very small scales. For the individual physical modules, we find a GPU chip-to-chip speedup ranging from $\approx3-5$. For more complex cosmological and hydrodynamical setups, where a large number of physical processes and overheads contribute to the total workload, the observed total chip-to-chip speedup (with the same number of nodes and CPUs per node) is $\approx2-3$. We ran our tests on four different supercomputers: Leonardo Booster (CINECA), MareNostrum-V (BSC), SuperMUC-NG2 (LRZ), and the CIP cluster of the Faculty of Physics at the Ludwig-Maximilians-Universit\"at (LMU).

0

cs.CR 2026-06-16

Fractional Verkle Trees cut state root time to 91 microseconds

by Ekleen Kaur, Everton Fraga

Fractional Verkle Trees: A Hypertree Decomposition and Verified Proof Serialization Architecture for High-Performance Blockchain State Accumulators

Hypertree split into sub-accumulators enables parallel updates and removes 4.85 PB of yearly network traffic across 6,000 nodes.

abstract click to expand

Modern blockchain state management faces a critical scalability bottleneck: maintaining cryptographic commitments over hundreds of millions of entries becomes computationally prohibitive. Ethereum's transition to Verkle Trees: polynomial commitment accumulators reducing proof sizes from O(width * depth) to O(depth) via constant-size IPA vector commitments, is a critical step toward stateless operation. Yet, current implementations exhibit pathological characteristics that burden home validators. We identify four inefficiencies in the reference go-verkle implementation \cite{kaur2025goverkle, kaur2025goethereum}: (1) phantom node creation during non-existent account deletion; (2) 64-byte database keys triggering excessive LSM-tree compaction; (3) redundant memory copying in proof deserialization; (4) a Proof of Absence wire format incompatibility causing non-deterministic serialization. We present Fractional Verkle Trees (FVT), a hypertree decomposition partitioning global state into N independent sub-accumulators coordinated by a Merkle commitment tree, achieving improved cache locality, zero-lock-contention goroutine-parallel commitment computation, and faster root recomputation (91 $\mu$s vs $\sim$500 ms). We address each inefficiency via existence checks, 32-byte SHA256 node references, zero-copy reference-counted buffers, and HashMap-based lexicographic deduplication. Benchmarks on Apple M1 Pro show 57\% heap allocation reduction (566,760 to 242,004 bytes per 10K proofs), parallel insertion at 2,433 ns/op, and network-wide elimination of 4.85 PB/year across 6,000 full nodes, advancing the Ethereum stateless roadmap.

0

cs.PF 2026-06-16

Memory clock state cuts edge-inference misses from 28% to 1.3%

by Jaehoon Kang

Edge-Inference Governors Need Memory-Clock State

Blind GPU-only models miss 25-28% of cycles at tight deadlines; EMC tables select the energy-minimal feasible clock under 2% QoS budget.

abstract click to expand

Frequency-aware latency estimators let deadline-aware DVFS governors schedule edge ML inference by modeling latency over CPU and GPU clocks, but they cannot observe the memory clock (EMC) -- a missing deployment state that decides whether a governor meets its deadlines and at what energy. We show this with a deployed, measured governor on a Jetson Orin NX: an EMC-blind GPU-only fit misses 25-28% of cycles at tight deadlines, whereas an EMC-aware refit holds misses to at most 1.3% under a 2% QoS miss budget by selecting a budget-feasible clock -- the energy-minimal one for periodic vision (calibrated module-rail power). The failure generalizes across three workload classes -- MobileNetV2, a ViT transformer, and Qwen2.5 LLM token decode (where saturated decode makes the aware policy lower-energy than the infeasible blind choice): a CPUxGPU estimator sends the deployed governor to an infeasible operating point, and only an EMC-aware model identifies the feasible side of the energy frontier. The effect is real and outside the CPUxGPU state abstraction: across two Orin SKUs sharing the same lockable EMC points it shifts median latency by up to ~45%, replicates on both, and survives a fused TensorRT fp16 engine. CPUxGPU models do not absorb it: per-lockable-point EMC tables are needed, a scoped inversion shows monotone assumptions can pick the wrong direction, and clustered misses make aggregate QoS rates understate deployment risk. We release the harness; this complements, not rebuts, the state of the art within its CPUxGPU scope.

0

cs.AR 2026-06-12

Adaptive routing cuts price of anarchy 3.1x in GPU inference

by Athos Georgiou (NCA)

The Price of Anarchy in Disaggregated Inference

Saturation raises selfish costs between prefill and decode pools; controller mitigates at 13% throughput cost

abstract click to expand

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).

0

cs.PF 2026-06-12

FIFO packet delay bounded by virtual delay

by Yuming Jiang

Beyond Virtual Delay: Improving Packet Delay Bound in Network Calculus

Maximum packet delay never exceeds maximum virtual delay, so a new bound derived from the curves is strictly better for leaky-bucket and rat

abstract click to expand

In network calculus, a fundamental result is the classical delay bound given by the horizontal deviation between the arrival and service curves. While widely used, the classical bound is derived from the notion of virtual delay. For a FIFO system, in this work, we first show that the maximum packet delay is always upper-bounded by the maximum virtual delay, revealing inherent conservatism when applying the virtual-delay-based bound to packet delay. Motivated by this insight, we revisit packet delay analysis and derive a new packet delay bound that requires no assumptions beyond the arrival and service curves. Specializing the new bound to a system with leaky-bucket arrival curve and rate-latency service curve shows strict improvement over the classical bound, which is further demonstrated through a case study in time-sensitive networking (TSN).

0

cs.DC 2026-06-12

Elastic GPU groups raise DiT throughput up to 6 times

by Xinwei Qiang, Yifan Hu +7 more

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

A runtime reassigns GPUs to running diffusion requests and forms new communication groups in microseconds, cutting latency 95 percent and SL

abstract click to expand

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $\mu$s. Our code is available at https://github.com/SJTU-Liquid/GF-DiT.

0

cs.PL 2026-06-11

nomp framework turns user metadata into domain-optimized GPU code

by Thilina Ratnayaka, Kaushik Kulkarni +9 more

nomp: A Framework for Building Domain Specific Compilers

Pragma model plus runtime aims to reuse proven patterns so productivity rises without losing performance or portability

abstract click to expand

The low-level GPU programming models (CUDA, HIP, OpenCL, etc.) provide detailed control of the data flow and execution plan of a program in order to extract close-to-metal performance. However, these have a steep learning curve due to the intricacies of their syntax and semantics. This reduces programmer productivity. On the other hand, high-level models (OpenMP, OpenACC, etc.) that serve as abstractions over the low-level models are aimed at improving programmer productivity but achieving performance on-par with the low-level models is a challenge. There are inherent trade-offs between productivity, portability and performance in both approaches and there is no one-size-fits-all solution which achieves all three simultaneously. However, we believe there is room to improve programmer productivity without sacrificing performance and portability by reusing optimization patterns specific to a given domain. To this end, we propose nomp: a framework for building domain specific compilers. nomp consists of a pragma based programming model and a runtime capable of code transformation and generation based on user provided metadata.

0

cs.PF 2026-06-11

Large model removed from runtime for 131 tokens per second on 8GB laptop

by Myeong Jun Jo

The Brain That Goes Quiet: Serving a Large Model's Knowledge at 131 Tokens per Second on an 8 GB Laptop by Removing the Large Model from the Runtime Path

Offline knowledge store plus BM25 router lets 1B model answer in 518 ms while keeping the large model idle

abstract click to expand

In earlier work I showed that a 35B-class Mixture-of-Experts model can be loaded and executed on a consumer laptop with 8 GB of GPU memory. That result solved a placement problem and immediately exposed a different one: even correctly placed, the large model needed roughly four seconds to answer, because it was still being invoked at every query. This paper documents what happened when I stopped invoking it. During an offline phase, the large model reads source documents and writes verified answer entries into a structured knowledge store; at runtime, only a lightweight router, a deterministic renderer, and a 1B-class model are active. On the same 8 GB laptop, end-to-end response time fell from approximately 4,465 ms to 518 ms, effective end-to-end throughput rose from 15.7 to 131 tokens per second, and the small model's streaming decode rate held at 226-237 tokens per second with a time-to-first-token of 29-62 ms. The bottleneck is structural: three different large models (Qwen, Gemma, and GLM class) all showed the same multi-second runtime cost, and all three produced usable knowledge stores offline. On a 563-entry store built from seventeen real documents, keyword routing collapsed to 1.5% top-1 accuracy while BM25-based routing reached 92.8% (99.4% top-3), and a confidence gate raised effective top-1 to 98.0% by escalating 12.3% of queries. Exact-match fidelity of the small model ranged from 9/9 to 0/9 across envelope formats carrying identical content. A 16-case verification gate blocked all ten corrupted entries while admitting all six supported ones.

0

cs.DC 2026-06-11

HPX async tasks run up to 26% faster than OpenMP on Cholesky

by Alexander Strack, Alexander Van Craen +1 more

From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

Explicit dependencies cut barriers and overhead versus fork-join models on 128-core AMD Zen 2 node.

abstract click to expand

Fork-join parallelism, popularized by OpenMP, remains the dominant model for shared-memory parallel programming, but its implicit synchronization barriers can penalize algorithms with inhomogeneous workloads. Asynchronous many-task (AMT) runtimes sidestep these barriers by expressing work as a dependency graph of fine-grained tasks. Yet, the actual performance benefit over a carefully written fork-join baseline is rarely quantified. In this work, we introduce Cholesky-Bench and use it to revisit the tiled Cholesky decomposition, a canonical irregular kernel, comparing four parallelization variants of the right-looking algorithm across two runtimes: the OpenMP implementations shipped with GCC and LLVM, and the HPX AMT runtime. The variants span classical fork-join, a collapsed fork-join that exposes additional inner-loop parallelism, synchronous tasking, and asynchronous tasking with explicit data dependencies. We benchmark all eight combinations on a dual-socket 128-core AMD Zen 2 node across multiple tile sizes and problem sizes. Our results show that across all variants, HPX outperforms OpenMP at the optimal tile size by 15%-30%. Specifically, asynchronous HPX tasks are up to 26% faster than their OpenMP counterparts, and exhibit roughly 3.8x smaller task overhead. Furthermore, the collapsed fork-join variants close most of the gap to synchronous tasking. Removing redundant synchronization barriers yields an additional improvement of 7% (OpenMP) to 14% (HPX). A GCC-versus-LLVM comparison further reveals compiler-specific differences in fork-join scheduling and task-creation overheads.

0

cs.DC 2026-06-11

LLM costs swing up to 36x on identical GPUs with request rate

by Chitral Patil

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

Calculators that fix utilization as an input understate self-hosting cost by 1/U, most at low enterprise loads.

abstract click to expand

Every public LLM cost calculator we surveyed treats GPU utilization as a fixed input -- entered by the user, baked in as a preset, or silently assumed at 100% -- never measured against the operator's actual load. We show that this assumption is the dominant source of error: on identical H100 hardware, effective cost spans \$0.21 to \$15.25 per million output tokens, an underutilization penalty of 2.5-24x across low-to-moderate enterprise loads (1-10 rps) and up to 36.3x near idle -- driven by one operator-controlled variable, offered request rate lambda, which sets in-flight concurrency via Little's Law and which no open-source calculator exposes. Because calculators take utilization as a user-supplied input, any utilization-naive estimate understates true cost by exactly 1/U, systematically mispricing self-hosting -- most severely over-selling it for low-traffic workloads. We propose a measurement methodology that parameterizes the relationship as C_eff = f(H, M, Q, lambda, L), validate it with 42 benchmarks across dense, ultra-sparse MoE, and sparse MoE models, and release vllm-cost-meter, an open-source cost meter that attaches to a live vLLM server and reports real \$/M-tokens against the operator's own traffic. We further show that FP8 quantization benefits the MoE architectures we tested roughly 2.2-2.4x more than the dense model (+69 to +74% vs. +31% peak throughput; n=3, broader validation needed), and our data are consistent with active parameter count, not total model size, being a primary predictor of saturation economics. To rule out single-hardware confounding we repeat the core sweep on A100 80GB PCIe (56 runs): the load-driven spread reproduces at 7.0-11.4x, the active-parameters ordering survives at FP8, and the dense-FP8 advantage inverts on silicon without native FP8 tensor cores -- a hardware-conditional caveat the framework already accommodates.

1 0

0

cs.AI 2026-06-11

Token spend in AI does not equal economic value

by Quanyan Zhu

AI Tokenomics: The Economics of Tokens, Computation, and Pricing in Foundation Models

Value depends on productivity, workflow position, hidden steps, and downstream effects instead of raw counts.

abstract click to expand

Tokens have become the practical accounting unit for modern foundation model services, linking information processing, computation, memory use, energy expenditure, pricing, and economic value. This paper develops a framework for AI tokenomics: the study of how tokens are generated, consumed, priced, allocated, and optimized across AI systems. We connect token-level technical costs to workflow-level production functions, enterprise resource allocation, measurement and instrumentation methods, and emerging market-design questions. The framework shows that token expenditure and economic value are distinct: value depends on marginal productivity, workflow position, hidden reasoning activity, risk, and downstream propagation effects. The paper concludes by identifying open research directions in hidden-token measurement, empirical calibration, token productivity, dynamic allocation, and token-based markets.

0

cs.GR 2026-06-11

New renderer framework needs only hundreds of Python lines

by Steve Rhyner, Sankeerth Durvasula +8 more

XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer

XPR breaks rendering into modular parallel operations that XLA compiles to GPUs, TPUs and CPUs for methods like 3DGS.

abstract click to expand

Point-based differentiable rendering underpins modern 3D reconstruction, novel-view synthesis, and learning-based graphics pipelines, but developing new rendering methods often requires extensive low-level implementation, hardware-specific kernels, and manually written backward passes. This limits rapid prototyping, reproducibility, exploration, and deployment, especially across diverse hardware platforms. This paper presents XPR, an extensible cross-platform framework for point-based differentiable rendering. XPR introduces a high-level programming interface that separates method-specific logic from the shared rendering pipeline, allowing users to implement new methods in a few lines of code. Its pipeline decomposes rendering into modular, statically shaped parallel operations that can be lowered by a cross-platform compiler to GPUs, TPUs, CPUs, and other ML accelerators. We demonstrate implementations of 3DGS, 3DGUT, and LinPrim, with only a few 100s lines of Python code, each of which can be compiled to a range of hardware platforms with the XLA compiler. These results show that XPR enables fast experimentation and portable execution for emerging point-based differentiable rendering systems.

0

cs.DC 2026-06-10

Fused kernels cut LLM prefilling latency 2x on AMD NPUs

by Wesley Pang, Gregory Hyegang Jun +2 more

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

TileFuse maps W4A16 and W8A16 directly onto XDNA2 for 64% lower energy use in Ryzen AI end-to-end tests

abstract click to expand

With the growing demand for on-device LLM inference, edge SoCs increasingly integrate NPUs to improve performance and energy efficiency under tight power and thermal budgets. However, practical LLM deployment on current client NPUs remains difficult: widely used quantization formats such as AWQ do not map cleanly onto many existing NPU software stacks, which are often proprietary and expose limited low-level control. In this work, we present TileFuse, a close-to-metal mixed-precision kernel library for AMD XDNA2 NPUs that targets GEMM/GEMV-based operators in quantized LLM inference. TileFuse brings practical low-bit formats such as AWQ-style W4A16 and W8A16 directly onto XDNA2, rather than forcing the model to be reshaped around an NPU-specific quantization scheme. TileFuse co-designs weight layout, metadata placement, mixed-precision microkernels, and array-level dataflow. Specifically, it fuses unpacking, dequantization, and GEMM/GEMV execution into a single kernel flow, introduces an interleaved pre-tiling layout that supports GEMM dimensions up to 32K, and redesigns GEMV dataflow to utilize the full 4x8 AIE array. Across kernel-level evaluations, TileFuse improves performance by up to 121.6% for GEMM and 281% for GEMV over full-precision baselines, while delivering more than 2x performance and energy-efficiency gains over strong iGPU baselines on GEMM. In end-to-end LLM experiments on Ryzen AI laptops, TileFuse achieves up to 2.0x lower prefilling latency with more than 64.6% lower energy consumption. Together, these results show that XDNA2 is a practical target for AWQ-style edge LLM inference and that native NPU support for off-the-shelf quantization can make NPUs substantially more usable in real client deployments.

0

cs.AR 2026-06-10

LLM framework produces working FPGA designs for vector math and convolution

by Vinamra Sharma, Xingjian Fu +2 more

Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA

SECDA-DSE automates design space exploration by using retrieval and reasoning to suggest hardware parameters that synthesize and execute suc

abstract click to expand

Designing FPGA-based accelerators for modern artificial intelligence workloads requires exploring a large and complex hardware design space that involves architectural parameters, data flow strategies, and memory hierarchies, making the process very time consuming. While existing methodologies such as SECDA enable rapid hardware-software co-design through SystemC simulation and FPGA execution, identifying efficient accelerator configurations remains a largely manual process requiring extensive domain knowledge. SECDA-DSE is a framework that integrates Large Language Models (LLMs) into the SECDA ecosystem to guide design space exploration (DSE) of FPGA-based accelerators. It combines a structured DSE Explorer for generating candidate architectures with an LLM Stack that performs reasoning-guided exploration using retrieval-augmented generation and chain-of-thought prompting, coupled with a feedback loop for iterative and reinforced refinement. Building on our previous work introducing SECDA-DSE, this paper extends its evaluation by generating three accelerator designs, including element-wise vector multiplication, 2D convolution, and matrix transpose, and performing end-to-end execution on FPGA hardware. The results show that SECDA-DSE can generate SECDA-compliant accelerator designs that are successfully synthesized and executed on FPGA hardware. Furthermore, the generated designs capture kernel-specific trade-offs between compute parallelism and data movement, highlighting the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.

0

cs.LG 2026-06-10

Flash-GMM kernel runs GMMs 20x faster without full matrix

by Gal Bloch, Ariel Gera +3 more

Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

Single-pass Triton code fits 100x larger datasets and cuts ANN distance computations by 1.7x when used in IVF.

abstract click to expand

We present \textbf{Flash-GMM}, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a \textbf{20$\times$} speedup over existing implementations and enables training on datasets more than \textbf{100$\times$} larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for $k$-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to $1.7\times$ fewer distance computations, or equivalently, yields $+2$--$12$ recall@10 at matched computational cost. We release the kernel as an open-source project.

0

cs.CL 2026-06-10

NPU runs full RAG at 4x lower energy than CPU on Snapdragon X Elite

by Zhiyuan Cheng, Longying Lai

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

End-to-end embedding, reranking and generation complete with no quality drop and 4x faster queries versus CPU baseline

abstract click to expand

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

0

cs.AR 2026-06-09

83-format catalog supplies bit-exact packs for FP8 and BF16

by Dmitrii Vasilev

An 83-Format Numeric Catalog with Bit-Exact Conformance Vectors: A Vendor-Neutral Reference for FP8, BF16, MXFP4, and Microscaling Formats

JSON documents and IEEE P3109 cross-walk give engineers a shared reference to diagnose numeric divergences across accelerators.

abstract click to expand

Numeric format proliferation in machine learning hardware -- FP8 (E4M3 and E5M2), BF16, MXFP4, microscaling block formats, and dozens of research variants -- has outpaced the availability of vendor-neutral, bit-exact reference material. Engineers porting models across accelerators encounter silent divergences that are difficult to diagnose without a shared ruler. This paper describes a catalog of 83 numeric formats spanning 13 families, a suite of six bit-exact conformance packs covering GF16, MXFP4 element, BF16, FP8 E4M3, FP8 E5M2, and E8M0 block scale, and an IEEE P3109 v3.2.0 cross-walk that maps each pack to its corresponding standards-track configured format. Each pack is a self-contained JSON document with a SHA-256 fingerprint, a shared row schema, and an anchor vector that encodes 3.0 -- the identity phi^2 + 1/phi^2 = 3 -- as a cross-pack sanity check. Packs are cross-validated against ml_dtypes 0.5.4 (Google/JAX); any divergence is documented explicitly and interpreted as a spec-permitted interpretation gap rather than hidden. The work is framed as registry filling: it does not propose new formats, make model-accuracy claims, or assert superiority over any vendor's implementation. All artifacts are publicly available at https://github.com/gHashTag/t27 under an open license.

1 0

0

cs.LG 2026-06-09

Single kernel runs entire Llama forward pass in one launch

by Jaber Jaber, Osama Jaber

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

AutoMegaKernel harness lets agents synthesize retargetable megakernels with static safety checks that match reference outputs.

abstract click to expand

AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_120 from one codebase, auto-generates correct megakernels for 10 of 10 supported models, and on a real SmolLM2-135M checkpoint reproduces HuggingFace greedy decode token-for-token (perplexity match 2.5e-7). An unattended, agent-drivable autoresearch loop self-improves the megakernel over its own baseline (1.25-1.72x). A search-found int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter inference fleet: L4 up to 1.33x, the current-gen L40S 1.25-1.27x, A10G up to 1.08x at scale, and the consumer RTX 5090 1.19-1.23x. The ordering is not a clean function of bandwidth (the 864 GB/s L40S beats the 600 GB/s A10G); the divide is inference-class vs training-class. AMK trails cuBLAS on the high-bandwidth training-class A100/H100, where the harness localizes the cross-SM-sync bottleneck; we report the gap plainly. This is a precision-asymmetric (W8A16 vs bf16) comparison at decode position 0; the largest real checkpoint is TinyLlama-1.1B. Code and the harness: https://github.com/RightNow-AI/AutoMegaKernel

0

cs.AI 2026-06-09

Contrastive pass fixes false causal links from embedding similarities

by Suraj Biswas, Saurabh Gupta +1 more

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Unrelated events like cortisol levels and stock volatility score 0.83 similarity; a pass over 72k pairs plus BODHI hard negatives fix domain

abstract click to expand

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

0

cs.DC 2026-06-09

Aging scheduler cuts LLM prefill latency over 10%

by Haoxin Liu, Jiayi Wang +2 more

Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving

Dynamic priorities from wait time and remaining work plus latency targets replace FCFS to lower mean and tail response times.

abstract click to expand

As large language models (LLMs) are increasingly deployed with highly heterogeneous workloads, chunked-prefill execution has emerged as a mainstream serving architecture. Balancing scheduling fairness and latency stability in such environments is critical; otherwise, severe head-of-line blocking and request starvation will degrade user experience. However, existing systems rely on rigid First-Come, First-Served (FCFS) policies and static token budgets, leading to fairness degradation and unpredictable latency jitter. To address these issues, we propose a fairness-aware and latency-controllable scheduling framework for chunked-prefill LLM engines. Specifically, we design a lightweight aging-based scheduling policy that dynamically calculates priorities using accumulated waiting time and remaining prefill work. Furthermore, we develop Latency-Prediction-Based Request Scheduling (LPRS) and Active Prefill Control (APC) to replace static budgets with target-time constraints and actively regulate prefill concurrency. We evaluated our scheduling framework on NVIDIA GPUs and Ascend accelerators using real-world workloads. Results show the aging policy reduces mean end-to-end latency by over 10\% compared to FCFS. Moreover, LPRS and APC significantly reduce P99 tail latency and suppress prefill fragmentation, confirming that the structural prefill control and the temporal latency constraints are fundamentally complementary. All codes have been released in Github.

0

cs.FL 2026-06-08

GLR parsers show only 3x slowdown vs LR(1) on deterministic grammars

by Huan Vo, Danushka Liyanage +3 more

An Empirical Comparison of General Context-Free Parsers

Benchmark of six general algorithms on 22 real grammars finds narrow variance and positions GLR as practical default.

abstract click to expand

Parsing underpins a vast range of software engineering tasks, from compilers and static analyzers to language servers and fuzz testing tools. Yet most parsers deployed in practice are deterministic (LL or LR), forcing developers not only to contort their grammars to fit the parser, but to simplify the very languages they design sacrificing expressiveness for the sake of parseability. General context-free parsers eliminate this constraint. Yet, despite decades of algorithmic development, no rigorous head-to-head comparison exists across the major families of parsing algorithms. We present the first unified, controlled benchmark of six generalized parsing algorithms: CYK, Valiant, Earley, GLL, RNGLR, and BRNGLR, plus deterministic LL(1) and LR(1) baselines, all implemented in Rust with shared data structures and parse-tree extraction, and evaluated across 22 grammars ranging from simple expressions to full C++ and Java. Our results show that the cost of generality is lower than widely assumed. On deterministic grammars, the GLR family incurs only a 3x median slowdown over LR(1), with a narrow and predictable variance. GLR is the clear performance winner among generalized parsers and a practical default choice for software engineering tools.

0

cs.LG 2026-06-08

Algebraic rewrite cuts attention data movement from quadratic to linear

by Lenore Mullin, Gaetan Hains

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

By eliminating every intermediate array before code is written, the method reaches the theoretical minimum traffic of O(n_dk + n_dv).

abstract click to expand

The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadratic memory traffic in the sequence length~$n$, and DRAM accesses cost 100--1000$\times$ more energy than arithmetic operations on contemporary hardware, so any analysis focused solely on FLOP counts fundamentally mischaracterises the bottleneck. We present a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and its numerically stable softmax, deriving a Denotational Normal Form (DNF) that eliminates all intermediate arrays -- including the implicit transposed-key buffer and every softmax temporary -- by algebraic construction rather than empirical tuning. The DNF achieves $O(n_{dk} + n{_{dv}})$ data movement versus $O(n^2 + n_{dk} + n_{dv})$ for the standard implementation, where $n$ is the sequence length, $dk$ is the key dimensionality and $dv$ the value dimensionality, and is verified numerically against PyTorch at full double-precision floating-point on concrete inputs. Unlike hardware-specific accelerators or empirical tiling schemes such as FlashAttention, MoA simultaneously provides array fusion, shape-transformation correctness, and predictive cost models from a single algebraic framework. Memory minimality is a theorem established before any code is written. A predictive performance model projects $2$--$100\times$ speedup and $2$--$50\times$ energy reduction, with the advantage widening at exascale. The derivation establishes a formally verified pipeline from Python specification through (ONF) Operational Normal Form, and dimension-lifted hardware mapping, providing performance-portable AI kernels of direct relevance to DARPA edge-deployment and DOE exascale priorities.

0

cs.PF 2026-06-08

Mixed-precision ANNS accelerator reaches 163x CPU speedup

by Mingkai Chen, Cheng Liu +4 more

ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing

Cluster-level precision selection maintains top-k accuracy while cutting energy 1100x versus CPU baselines

abstract click to expand

Approximate nearest neighbor search(ANNS) is a critical kernel in modern applications such as LLM and recommendation systems.However,its efficiency is fundamentally limited by the need to compute distances between a query and a massive number of high-dimensional vectors,most of which are non-neighbors.Existing approaches reduce redundancy via index optimization or early termination,but remain constrained by fixed-precision computation,leading to unnecessary arithmetic and memory bandwidth overhead.This paper presents ANNS-AMP,an adaptive mixed-precision framework and accelerator that adapts the precision of distance computation to the characteristics of queries and data distribution.The key insight is that different regions of the vector space require different levels of precision to preserve top-k accuracy.ANNS-AMP leverages the clustered structure of PQ-based indices and introduces a lightweight predictor to determine cluster-level precision at runtime based on features such as scale,radius,and query distance.To efficiently realize variable-precision execution,we design a bit-serial accelerator with a bit-interleaved data layout,enabling throughput to scale with reduced precision while mitigating memory bandwidth bottlenecks and load imbalance through a greedy scheduling strategy.Moreover,the runtime predictor can also reuse the bit-serial computing array for efficient runtime prediction and can be fitted to the ANNS pipeline without performance penalty.According to our experiments on representative datasets,ANNS-AMP achieves 163.76x,10.57x,and 2.06x performance speedups on average,and reduces average energy consumption by 1100.00x,39.41x,and 6.66x compared to CPU,GPU,and customized ANNS accelerator baselines,respectively,while maintaining accuracy loss below 2.7%.These results demonstrate that adaptive mixed-precision computing is a promising direction for efficient large-scale ANNS.

0

cs.PF 2026-06-08

Dataflow optimization parallelizes genome aligners across regions

by Shiv Sundram

Dependencies and Dataflow in Seed-Filter-Extend Pipelines

Synthesizing four prior aligners removes serial constraints so candidate regions run in parallel and local alignments move to GPUs without a

abstract click to expand

Comparing genomes is critical for discovering mutations, tracking evolutionary lineages, and advancing cross-species genomics. Fundamentally, this reduces to an O(n^2) string-matching dynamic programming (DP) problem, a challenge that has driven decades of performance research. However, executing a strict O(n^2) DP algorithm is computationally intractable for genomes spanning millions to billions of base pairs. Consequently, modern aligners rely on global heuristics to identify thousands of candidate similarity regions between species. Unfortunately, these methods are burdened by complex serial dependencies. Once candidate regions are identified, the pipeline executes localized DP alignments, which introduce their own non-trivial heuristics and irregular data dependencies. While parallelizing dense, two-dimensional DP is a well-studied problem, accelerating this end-to-end pipeline is significantly more challenging. Parallelizing across candidate regions and offloading irregular, heuristic-laden local alignments to modern hardware (such as GPUs) remains a major hurdle. In this work, we address the challenge of overcoming these serial bottlenecks by optimizing the global pipeline across regions. We take inspiration from four papers: LASTZ, SegAlign, Darwin-WGA, and SNAP, synthesizing findings across each to inform optimizations, which we either prototype or implement directly in LASTZ.

0

cs.AI 2026-06-05

Activation probe recovers 10% of robot task failures

by Josef Chen

AEGIS: A Backup Reflex for Physical AI

AEGIS hands control to a stronger policy only on steps flagged early by monitoring weak-policy activations, outperforming blind or random sw

abstract click to expand

Long-horizon robot manipulation tends to fail gradually: one bad step degrades the state, and the policy spirals into a basin from which it cannot recover. The failure is often visible before it happens. We introduce AEGIS (Activation-probe Early-warning, Gated Inference Switching), a selective escalation method that uses a lightweight probe on a weak policy's frozen activations to detect high-risk steps while there is still time to act. When the probe flags a step, control switches to a stronger separate policy, but only for the steps that need it. On LIBERO-Spatial, AEGIS recovers 10.1% of the trajectories the weak policy alone loses, versus 4.6% for budget-matched blind escalation and 5.1% for a random-trigger placebo. These gains are significant under one-sided exact paired McNemar tests with Holm-Bonferroni adjustment over three pre-registered contrasts: +5.4pp over blind escalation, p=8.5e-6; +5.0pp over random triggering, p=1.0e-4; paired-trajectory bootstrap CIs exclude zero. AEGIS activates the stronger policy on only 38% of steps, so the lever is timing rather than compute. The probe clears its precondition with an early-window AUROC of 0.764, 95% CI [0.70, 0.84], read from the weak-policy path over the first 30% of trajectory steps before any handoff. We pre-register the full analysis plan, including a conditional recovered-task-rate estimand and explicit kill criteria, and confirm the result on 700 common-random-number episodes per arm, with nA-fail=646.

0

cs.DS 2026-06-05

PivCo-Huffman decodes faster than prior Huffman codecs

by Marcin Zukowski

PivCo-Huffman

Wavelet tree pivots enable SIMD operations and let ANS coding apply to skewed nodes for better ratios at high speed.

abstract click to expand

Huffman encoding has been an enduring technique for 70+ years, ubiquitous in compression algorithms since its invention. In this paper we propose a new approach to Huffman coding, based on a data structure from wavelet trees. The resulting pivot-coded Huffman (PivCo-Huffman) enables high-performance SIMD-friendly encoding and decoding operations. In our tests PivCo-Huffman consistently outperforms state-of-the-art Huffman codecs in decoding throughput. Additionally, we show how ANS-coding can be selectively applied to skewed nodes in this structure, yielding compression ratios approaching those of ANS-based codecs while preserving very high decompression speeds.

0

cs.PL 2026-06-04

Low-bit tagging remains fastest for symbolic workloads

by Stephen M. Watt

Look Before You Leap: Checking In on Type Tag Checking

Microbenchmarks on AArch64 and x86-64 show local bit operations beat heap reads for tags while NaN-boxing saves allocation for floats.

abstract click to expand

Tagging of generic dynamic values is important in symbolic-computation and dynamic-language systems, but the trade-offs change as machine architectures and workloads evolve. In particular, old folklore about boxed values, immediate values, and type tags must be recalibrated from time to time. We revisit the performance of badged object headers, low-bit tagging, and two NaN-boxing layouts on a range of platforms in use today, including AArch64 and x86-64 architectures from different manufacturers. The experiments isolate two distinct effects: the cost avoided by not heap-allocating common scalar values, and the cost avoided by obtaining tag information from the value word rather than by performing a heap read. The results show that several local bit operations are often cheaper than opening a heap object to obtain a tag or small value. Low-bit tagging remains the simplest and usually fastest choice for mostly symbolic workloads, while NaN-boxing is close in access cost and avoids the time and space of heap allocation for ordinary floating-point values.

0

cs.AR 2026-06-04

Cortex-M measurements set lower-bound timings for satellite AI inference

by Carlos Rafael Tordoya Taquichiri, Hans Dermot Doran +1 more

Quantized AI Inference on Constrained Embedded Platforms for Small-Satellite Settings

Characterization treats orchestration as an explicit choice and supplies estimates for multi-core quantized workloads under tight power and

abstract click to expand

In resource-constrained small-satellite settings, AI inference must operate under tight size, power, and payload budgets, which tend to limit onboard compute capability and data handling. These conditions motivate establishing a clear baseline for quantized AI inference under bounded compute and memory resources. To instantiate this baseline, a representative embedded-vision neural-network workload serves as the reference case. With this motivation, this paper presents a measurement-based characterization of quantized execution for this AI workload on highly constrained embedded platforms (for instance, Cortex-M), grounded as a lower-bound operating point. In this regime, scaling tends to rely on explicit orchestration rather than OS-managed, transparent multicore scheduling, and timing behavior is shaped by instruction efficiency and memory movement. As a result, the characterization provides a structured reference for estimating execution time across orchestrated configurations (e.g., multiple cores and/or devices), treating orchestration and architectural variation as explicit design choices. We report latency metrics alongside data-movement observations, and interpret these measurements in light of ALU/SIMD utilization under quantized arithmetic for the Cortex-M. Finally, we outline how this baseline provides a reference point for positioning the results against more space-typical embedded processor classes (e.g., LEON/NOEL-V).

0

cs.AR 2026-06-03

Reverse iteration and S=256 stop P-collapse in FP8 attention

by Reed Lau

P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2⁸

Forward KV order underflows a normal-tail fraction of non-sink probabilities; the reverse order plus scale 256 guarantees none do.

abstract click to expand

FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix~$P$ is cast to FP8 before the $P \cdot V$ matrix multiplication. We analyze two implementation choices that affect output precision under the \emph{Attention Sink} phenomenon: (1)~the KV block iteration order, and (2) the static scaling factor applied to $P$ before casting. We show that forward KV iteration causes \emph{P-collapse} -- to leading order a fraction $\Phi(\Delta + \delta_k - 6.93 - \ln S)$ of non-sink $P$ values underflow to zero, where the small shift $\delta_k \approx 1$ (for $k_{\text{sink}}{=}4$) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with $S{=}256$. We further give a constructive characterization of $S = 256 = 2^8$ as the static scale that simultaneously satisfies (i)~bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function $dp(S)$ over the E4M3 number line ($dp = 2^{-4}$, the minimum worst-case quantization step), and (iii)~the maximum normal-range coverage \emph{among bit-exact ($2^k$) scales} (a non-bit-exact scale such as $448$ attains slightly higher coverage; sec.5}). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of \emph{why} these choices are good and a closed-form threshold $\Delta_c = 6.93 + \ln S - \delta_k$ for predicting kernel-level precision loss. Kernel-faithful experiments ($Q, K, V$ in FP32 to isolate the P-cast effect) show $3$-$10\times$ MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined -- which motivated updating the hpc-ops kernel from $S{=}1$ to $S{=}256$.

0

cs.LG 2026-06-03

FlashbackCL cuts temporal forgetting by 68% in federated learning

by Mubarak A. Ojewale, Adriana E. Chis +3 more

FlashbackCL: Mitigating Temporal Forgetting in Federated Learning

Decayed label counts and class-balanced replay raise accuracy 7-10% over Flashback on CIFAR-10 with 50 clients under temporal shifts.

abstract click to expand

Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client's distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.

0

cs.PF 2026-06-03

Network term makes cache-only LLM schedulers arbitrarily suboptimal

by Mubarak Adetunji Ojewale

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

NetKV uses a network cost oracle to pick decode instances and cuts TTFT up to 21 percent while lifting SLOs by 20 points.

abstract click to expand

Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.

1 0

0

cs.DC 2026-06-03

Object storage slashes stream shuffle costs over 40x

by Sören Henning, Otmar Ertl +1 more

BlobShuffle: Cost-Effective Repartitioning in Stream Processing Systems via Object Storage Exemplified with Kafka Streams

Batching records and notifications keep 95th percentile latency below 2 seconds at scale

abstract click to expand

Shuffling or repartitioning data streams is an essential operation of state-of-the-art stream processing frameworks to support stateful workloads in a large-scale, distributed setting. In today's cloud deployments, however, shuffling can become a major cost driver due to substantial network traffic across multiple availability zones (AZs) as well as an operational burden when operating a high-throughput, strongly consistent messaging backbone at scale. We present BlobShuffle, a novel approach to cost-effective shuffling for stream processing systems that leverages cloud object storage as an intermediate exchange layer. Instead of sending all shuffled records directly, BlobShuffle groups records into batches, stores these batches in cloud object storage, and forwards only compact notifications. Downstream operators use these notifications to retrieve the relevant batches and extract the corresponding records. BlobShuffle balances cost efficiency and latency through configurable batching and a distributed caching mechanism. BlobShuffle is implemented as an add-on for Kafka Streams that requires only minimal code changes to existing applications, leaves Kafka and the underlying infrastructure unmodified, and preserves Kafka Streams' consistency and correctness guarantees. In a large-scale experimental evaluation on a Kubernetes-based AWS deployment, we show that BlobShuffle can reduce shuffling costs by more than 40x compared to native Kafka Streams shuffling while keeping the 95th percentile shuffle latency below 2 seconds. Moreover, it scales to processing more than 2 GiB/s without encountering a scalability limit in our experiments, indicating that BlobShuffle can economically support shuffle-intensive workloads at large scale.

0

cs.PF 2026-06-03

Feedback loop cuts token estimate error 39% in multi-tenant LLM serving

by Kathiravan Palaniappan

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

Shortest-job-first then delivers 42% lower median latency than FIFO when GPU resources are contested.

abstract click to expand

The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains challenging. In practice, admission-time workload estimates may deviate from observed execution behavior, leading to workload misclassification, queue imbalance, increased tail latency, and degraded Quality-of-Service (QoS). This paper presents DriftSched, a QoS-aware scheduling framework for multi-tenant LLM inference serving on NVIDIA L4 GPUs. DriftSched combines workload classification, token-budget estimation, tenant-aware queue management, and an online feedback mechanism to refine workload estimates using runtime observations. The framework evaluates FIFO, Priority, Weighted, Shortest-Job-First (SJF), and Aging Priority scheduling policies under heterogeneous multi-tenant workloads. Experimental results show that adaptive calibration reduces workload estimation error by an average of 38.8% (MAE) and 40.5% (RMSE), improving workload classification stability. Among all evaluated schedulers, SJF achieves the best overall performance, reducing median end-to-end latency by approximately 42% and P99 latency by approximately 16% relative to FIFO under sustained GPU contention. The results further indicate that scheduler selection has a greater impact on latency behavior than runtime calibration alone, while accurate workload characterization largely eliminates systematic estimation drift. This work contributes a reproducible framework for studying workload-estimation fidelity and QoS-aware scheduling in multi-tenant GPU inference systems.

0

cs.LG 2026-06-02

Key-value sharing halves transformer KV cache

by Ali Kayyam, Anusha Madan Gopal +1 more

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Q-K=V keeps quality on 1.2B models with 3.1 percent perplexity rise and stacks with GQA for 87.5 percent total savings.

abstract click to expand

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

0

cs.AI 2026-06-02

Action gate cuts robot memory writes by 7x at constant size

by Josef Chen

AURA: Action-Gated Memory for Robot Policies at Constant VRAM

Matches base policy success on LIBERO-Long using 7 times fewer writes and fixed 4kB memory.

abstract click to expand

The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint. AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps. On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.

2 0

0

cs.PF 2026-06-02

γ-CounterBoost minimizes response time tails with job types alone

by Nils Charlet, Benny Van Houdt

γ-CounterBoost: Optimizing response time tails using job type information only

Policy uses only type counts to achieve optimal tail among its class, extending Nudge-M to multiple job types.

abstract click to expand

In a recent paper the $\gamma$-Boost scheduling policy was shown to minimize the tail of the response time distribution in a light-tailed M/G/1-queue. This policy schedules jobs using a boosted arrival time, defined as the arrival time of a job minus its boost, where the boost of a job depends on its exact job size. The $\gamma$-Boost policy can also be used when only partial job size information is available, such as the type of an incoming job. In such case the boost $b_i$ of a job depends solely on its type $i$ and $\gamma$-Boost was shown to optimize the tail among all boost policies, where a boost policy is fully determined by the $b_i$ values. In the partial information setting $\gamma$-Boost relies on two types of information: job types and arrival times. This paper focuses on the problem of minimizing the tail in a light-tailed M/G/1-queue in the partial job size information setting when the scheduler only makes use of the job types and {\it does not exploit arrival times}. Prior work showed that in case of $2$ job types the so-called Nudge-$M$ policy minimizes the tail in a large class of scheduling policies. In this paper we introduce the $\gamma$-CounterBoost policy in the partial information setting with $d \geq 2$ job types and prove that it minimizes the tail in an even broader class of scheduling policies called Contextual CounterBoost policies. The $\gamma$-CounterBoost policy reduces to the Nudge-$M$ policy in case of $d=2$ job types.

0

cs.CR 2026-06-02

HQC decoder uses 18x less energy on Snapdragon via vector kernels

by Vu Minh Chau, Nguyen Ngoc Kiet +4 more

Implementation and Optimization of HQC Decoding on NPU-Integrated Devices

Reformulating Reed-Muller and Reed-Solomon steps around HVX offloads the CPU on NPU devices.

abstract click to expand

Hamming Quasi-Cyclic (HQC) has been selected by NIST for standardization as an additional code-based key-encapsulation mechanism, providing algorithmic diversity alongside lattice-based post-quantum cryptography. Efficient deployment of HQC on mobile and embedded platforms, however, requires careful optimization of its decoding procedure, whose Reed-Muller and Reed-Solomon components dominate the computational cost. This paper studies HQC decoding on Qualcomm Hexagon processors in NPU-integrated devices, focusing on the Hexagon Vector eXtensions (HVX) backend rather than a tensor-inference engine. We observe that HQC decoding naturally exposes vector-structured computation, including Reed-Muller reliability vectors, Hadamard-transform coefficients, Reed-Solomon syndrome vectors, finite-field products, and packed support-point evaluations. Based on this observation, we redesign the dominant decoding kernels around HVX-friendly data layouts and execution patterns, including a vectorized Reed-Muller Hadamard transform, scalar-equivalent peak selection, HVX-oriented finite-field arithmetic, vectorized syndrome computation, and shortened-support locator-root evaluation. We implement and evaluate the optimized decoder using both Hexagon simulator measurements and real-device experiments on a Snapdragon~8 Gen~2 hardware development kit. The results show that Hexagon/HVX-assisted decoding substantially reduces latency and energy consumption, improving energy efficiency by up to $18.13\times$ while significantly offloading host CPU work. These results indicate that NPU-integrated mobile platforms can serve as effective backends for structured post-quantum cryptographic decoding when the underlying kernels are reformulated around vector execution.

0