Busy fractions identify MAP parameters for QBD queues
An EM algorithm derives the necessary expected statistics from utilization intervals alone, allowing parameter estimation without event-leve
full image
Performance
Covers performance measurement and evaluation, queueing, and simulation. Roughly includes material in ACM Subject Classes D.4.8 and K.6.2.
An EM algorithm derives the necessary expected statistics from utilization intervals alone, allowing parameter estimation without event-leve
full image
Stochastic Connectivity as the Foundation of a Runtime Model for Microservice Availability Analysis
Monte Carlo on reconstructed graphs and probability measures replaces repeated fault-injection tests
full image
LUMA: Benchmarking Segmentation via a Lightweight Universal Mask Adapter
Fixed LUMA head shows pretraining objectives matter more than architecture across 20 backbones on ADE20K and Cityscapes.
full image
BaseRT: Best-in-Class LLM Inference on Apple Silicon via Native Metal
Native Metal kernels and direct memory access outperform general frameworks across model sizes and quantizations.
full image
A Multi-Dimensional, Per-Pass Empirical Study of the LLVM Optimization Pipeline
Per-prefix measurements on 30 kernels show most gains arrive late and the final config loses on size-speedup for 29 kernels.
full image
The Fourth-Root Complexity of Data Movement
Abstract memory hierarchy shows per-access costs scale as N to the 1/4 for common apps, distinguishing power-law from exponential miss ratio
TraceLab: Characterizing Coding Agent Workloads for LLM Serving
Trace of 4300 sessions finds patterns that suggest specific serving optimizations for agentic LLMs
full image
FBench: A Flexible Benchmark for CFG-Based What-If Exploration of HPC I/O Patterns
Reproduces real workload behavior and shows collective I/O can cut bandwidth 30x on Lustre.
full image
Are There Manufacturer Differences in Hard-Drive Reliability?
Backblaze analysis controls for age, capacity, temperature and form factor to reveal manufacturer differences in HDD failure rates.
full image
Five Ways to Build a Concurrent Linked From Coarse-Grain Locking to Lock-Free Algorithms
Benchmarks of five designs show coarse-grain and lazy win on small read-heavy lists while lock-free competes on large ranges and high thread
full image
KernelSight-LM: A Kernel-Level LLM Inference Simulator
Cross-generation tier needs no target data and improves 1.8x over roofline for serving predictions.
full image
High-Performance Resilient Multi-GPU Hybrid Particle-in-Cell Monte Carlo Simulations at Scale
Hybrid MPI+OpenMP framework adds load balancing and ADIOS2 checkpointing for uniform and non-uniform loads on Frontier, MN5, and LUMI-G.
full image
DiStash: A Disaggregated Multi-Stash Transactional Key-Value Store
DiStash coordinates reads and writes on KV copies across DRAM, SSD, and HDD in one atomic step, avoiding separate operations that create inc
full image
Mixed-Precision For Energy Efficient Computations
Reactor and hydrodynamics benchmarks keep accuracy while cutting both metrics.
full image
Cluster, Route, Escalate: Cascaded Framework for Cost-Aware LLM Serving
Clustering assigns queries to cheap models and quality checks escalate hard cases, cutting time per output token.
full image
On-Demand Service Zone Design for Energy-Constrained Spatial Queueing Systems
Energy-constrained hypercube analysis shows zone design must come before battery upgrades, with larger batteries sometimes lowering readines
Compiler-Driven Approximation Tuning for Hyperdimensional Computing
ApproxHDC searches the space of possible approximations to deliver performance gains on CPUs, GPUs, and memory accelerators with little accu
full image
TileMaxSim: IO-Aware GPU MaxSim Scoring with Dimension Tiling and Fused Product Quantization
Tiling and fused quantization let MaxSim read each embedding once, cutting ColBERT scoring latency by 98% on H100 GPUs.
full image
SOLAR: AI-Powered Speed-of-Light Performance Analysis
LLM translates source to intermediate form for analytical computation of theoretical minimum execution times at varying detail levels.
full image
Axon: A Synthesizing Superoptimizer for Tensor Programs
Axon synthesizes kernels for AI accelerators by propagating operators and checking equivalence over unbounded tensor domains.
full image
EmuGEMM: Fused Tensor Core Kernels for Precision Emulation in Matrix Multiplication
By keeping Ozaki intermediates on-chip, EmuGEMM beats cuBLAS TF32 by up to 1.7x on Hopper and Blackwell at matching accuracy.
full image
Above the Inner Loop: Exceeding Accelerate at LLM Prefill GEMM on the M1 AMX
Bit-exact fp32 path using panel threading and weight pre-packing wins all twelve shapes at S=128 and lifts llama.cpp throughput 1.44x.
full image
Power-Flexible AI Data Centers: A New Paradigm for Grid-Responsive Compute
130 kW real-world tests show rapid reductions and load shifting while keeping priority jobs on track
full image
Scheduling jobs with unknown size distribution in a M/G/1 queue: the shifted empirical Gittins
For M/G/1 queues with unknown bounded job sizes, n samples yield indices whose policy matches optimal response time in the large-n limit.
full image
AI-PAVE-Br and the new Golden Set dataset raise accuracy for Portuguese e-commerce catalogs.
Two separate GPU pools let cold models share KV capacity on aggregate demand instead of reserving per-model peaks.
full image
Generation-axis pipelining and trainer-assisted rollout cut bubbles in disaggregated setups for visual generative models.
Serialized secure data movement, not GPU compute, explains the loss on Blackwell platforms under TDX.
full image
LMS-AR: LMS Prediction-based Adaptive Regulator for Memory Bandwidth in Multicore Systems
Prediction from outside the regulated cores lowers contention effects on SPEC benchmarks by enforcing per-core bandwidth allocations.
full image
Memory Layouts for GPU-Data Transfer Buffering in SPH
Access-pattern decomposition of particle data lowers total offloading overhead by 12-25% as transfers dominate runtime.
full image
Single-node tests on SuperMUC-NG Phase 2 find 4-12x throughput gains for molecular dynamics and astrophysics workloads, but gains shrink wit
full image
Learning Filters with Certainty
Counting Bloom Filters keep numbers instead of bits; those numbers improve accuracy when passed to combined machine learning systems.
full image
Enabling Cloud-Level Accuracy in Edge AI through IoT Data Preprocessing
Preprocessing raw readings into text descriptions lets local models handle air quality and comfort queries at 0.22s latency
full image
When Is a Columnar Scan Bandwidth-Bound? A Decode-Throughput Law and Its Cross-Hardware Validation
A one-parameter law predicts the bandwidth fraction columnar scans achieve on x86 and Apple silicon
full image
Apple Neural Engine: Architecture, Programming, and Performance
Datapath, compiler format, weight compression, and command protocol detailed across A11 to A18 and M1 to M5 chips.
full image
Load Testing for Machine Learning Model Serving Systems at Scale
Adaptive load testing framework in 14 case studies reduces under-provisioning incidents and improves GPU use.
full image
KineticSim: A Lightweight, High-Performance Execution Engine for Real-Time Market Simulators
Persistent state in thread blocks and cooperative clearing cut critical path from linear to log-plus-ceil and remove per-step global writes.
full image
Does Mixture-of-Experts Actually Help Inference on Consumer and Edge Hardware? An Empirical Study
Benchmarks find total parameter count, not active parameters, sets inference cost when memory bandwidth is the limit.
full image
UltraQuant: 4-bit KV Caching for Context-Heavy Agents
Yields 2.3x overall first-token speedup and 1.63x higher output rate versus FP8 on multi-turn long-context tasks
full image
Randomized Sketching is Robust to Low-Precision Rounding on GPUs
Across incoherent, coherent and adversarial inputs the sketch distribution, not the quantization rule, determines embedding accuracy.
full image
Group Commit Self-Clocks: Why Tuning Is Unnecessary Above a Device-Set Load Threshold
Closed-loop client behavior makes the optimal wait time fall below flush cost so tuning adds no value
full image
Optimal Calibration of Quantum Network Links
Analytical method for linear repeater chains meets any end-to-end fidelity target by balancing each link's uptime against calibration downti
full image
The Right Call for Software Benchmarking: Consistent Decisions in Stateful Environments
Program-specific biases cancel in simple experiment designs, enabling correct identification of the fastest program without modeling dynamic
full image
Gravity, hydro, cluster and galaxy runs agree within small-scale noise while delivering 2-3x chip-to-chip speedup.
full image
Hypertree split into sub-accumulators enables parallel updates and removes 4.85 PB of yearly network traffic across 6,000 nodes.
Edge-Inference Governors Need Memory-Clock State
Blind GPU-only models miss 25-28% of cycles at tight deadlines; EMC tables select the energy-minimal feasible clock under 2% QoS budget.
full image
The Price of Anarchy in Disaggregated Inference
Saturation raises selfish costs between prefill and decode pools; controller mitigates at 13% throughput cost
full image
Beyond Virtual Delay: Improving Packet Delay Bound in Network Calculus
Maximum packet delay never exceeds maximum virtual delay, so a new bound derived from the curves is strictly better for leaky-bucket and rat
GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving
A runtime reassigns GPUs to running diffusion requests and forms new communication groups in microseconds, cutting latency 95 percent and SL
full image
nomp: A Framework for Building Domain Specific Compilers
Pragma model plus runtime aims to reuse proven patterns so productivity rises without losing performance or portability
full image
Offline knowledge store plus BM25 router lets 1B model answer in 518 ms while keeping the large model idle
full image
From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX
Explicit dependencies cut barriers and overhead versus fork-join models on 128-core AMD Zen 2 node.
full image
Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation
Calculators that fix utilization as an input understate self-hosting cost by 1/U, most at low enterprise loads.
full image
AI Tokenomics: The Economics of Tokens, Computation, and Pricing in Foundation Models
Value depends on productivity, workflow position, hidden steps, and downstream effects instead of raw counts.
full image
XPR: An Extensible Cross-Platform Point-Based Differentiable Renderer
XPR breaks rendering into modular parallel operations that XLA compiles to GPUs, TPUs and CPUs for methods like 3DGS.
full image
TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs
TileFuse maps W4A16 and W8A16 directly onto XDNA2 for 64% lower energy use in Ryzen AI end-to-end tests
full image
Towards Autonomous Accelerator Design: FPGA Accelerator Generation with SECDA
SECDA-DSE automates design space exploration by using retrieval and reasoning to suggest hardware parameters that synthesize and execute suc
full image
Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering
Single-pass Triton code fits 100x larger datasets and cuts ANN distance computations by 1.7x when used in IVF.
full image
Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
End-to-end embedding, reranking and generation complete with no quality drop and 4x faster queries versus CPU baseline
full image
JSON documents and IEEE P3109 cross-walk give engineers a shared reference to diagnose numeric divergences across accelerators.
AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis
AutoMegaKernel harness lets agents synthesize retargetable megakernels with static safety checks that match reference outputs.
full image
Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery
Unrelated events like cortisol levels and stock volatility score 0.83 similarity; a pass over 72k pairs plus BODHI hard negatives fix domain
full image
Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving
Dynamic priorities from wait time and remaining work plus latency targets replace FCFS to lower mean and tail response times.
full image
An Empirical Comparison of General Context-Free Parsers
Benchmark of six general algorithms on 22 real grammars finds narrow variance and positions GLR as practical default.
full image
By eliminating every intermediate array before code is written, the method reaches the theoretical minimum traffic of O(n_dk + n_dv).
full image
ANNS-AMP: Accelerating Approximate Nearest Neighbor Search via Adaptive Mixed-Precision Computing
Cluster-level precision selection maintains top-k accuracy while cutting energy 1100x versus CPU baselines
full image
Dependencies and Dataflow in Seed-Filter-Extend Pipelines
Synthesizing four prior aligners removes serial constraints so candidate regions run in parallel and local alignments move to GPUs without a
full image
AEGIS: A Backup Reflex for Physical AI
AEGIS hands control to a stronger policy only on steps flagged early by monitoring weak-policy activations, outperforming blind or random sw
full image
Wavelet tree pivots enable SIMD operations and let ANS coding apply to skewed nodes for better ratios at high speed.
full image
Look Before You Leap: Checking In on Type Tag Checking
Microbenchmarks on AArch64 and x86-64 show local bit operations beat heap reads for tags while NaN-boxing saves allocation for floats.
full image
Quantized AI Inference on Constrained Embedded Platforms for Small-Satellite Settings
Characterization treats orchestration as an explicit choice and supplies estimates for multi-core quantized workloads under tight power and
full image
P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2⁸
Forward KV order underflows a normal-tail fraction of non-sink probabilities; the reverse order plus scale 256 guarantees none do.
full image
FlashbackCL: Mitigating Temporal Forgetting in Federated Learning
Decayed label counts and class-balanced replay raise accuracy 7-10% over Flashback on CIFAR-10 with 50 clients under temporal shifts.
full image
NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference
NetKV uses a network cost oracle to pick decode instances and cuts TTFT up to 21 percent while lifting SLOs by 20 points.
full image
Batching records and notifications keep 95th percentile latency below 2 seconds at scale
full image
DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference
Shortest-job-first then delivers 42% lower median latency than FIFO when GPU resources are contested.
full image
Do Transformers Need Three Projections? Systematic Study of QKV Variants
Q-K=V keeps quality on 1.2B models with 3.1 percent perplexity rise and stacks with GQA for 87.5 percent total savings.
full image
AURA: Action-Gated Memory for Robot Policies at Constant VRAM
Matches base policy success on LIBERO-Long using 7 times fewer writes and fixed 4kB memory.
full image
γ-CounterBoost: Optimizing response time tails using job type information only
Policy uses only type counts to achieve optimal tail among its class, extending Nudge-M to multiple job types.
full image
Implementation and Optimization of HQC Decoding on NPU-Integrated Devices
Reformulating Reed-Muller and Reed-Solomon steps around HVX offloads the CPU on NPU devices.
full image