pith. sign in

cs

Computer Science

Top Pith
7
cs.SE 2026-07-01

Models reach 13.8% on executable state changes in Scratch tests

by Yufeng Lin, Jialu Zhang

ScratchWorld: Evaluating If World Models Compute Executable Consequences

Benchmark uses verified VM transitions to separate rule-following from copied persistent state.

Figure from the paper full image
abstract click to expand
World-model evaluations often score a predicted future by overlap with a target state or observation. In sparse-change worlds, this can turn copied persistent state into apparent accuracy. We introduce ScratchWorld, an offline diagnostic benchmark that treats Scratch projects as executable worlds and uses a pinned Scratch VM to produce replay-verified transitions, hidden variables, causal traces, and counterfactual outcomes. ScratchWorld evaluates next-state prediction, long-horizon tracking, causal event attribution, and counterfactual prediction; each replay-verified target can be presented under raw-program, structured-state, natural-language, or rendered input modalities, and our experiments use the structured-state condition. Its primary state metric is value-aware changed-field $F_1$, which gives credit only for the changed field and its executed value. In a 659-example release, seven prompted language/reasoning models reach at most 13.8% value-aware changed-field $F_1$ in a state-only partial-observation stress test. A same-instance copy diagnostic makes the overlap confound concrete: copying the input state reaches 98.0% implied full-state field accuracy and 0.0% changed-field $F_1$, with the largest inflation on real projects. Auxiliary diagnostics separate hidden-state rollout drift, intervention sensitivity, causal attribution, and perturbation robustness. Across these settings, models often react to actions or interventions without following the executable rule that determines the changed value.
4 0
Top Pith
1
cs.RO 2026-06-29

Physics models cut table tennis ball prediction error by 59%

by Christian Conti (1), Bilan Yang (1) +10 more

Physics Models for Sim-to-Real Transfer in Professional-Level Robot Table Tennis

Aerodynamic, buckling, and residual-contact models enable RL policies that compete against professional players after sim-to-real transfer.

Figure from the paper full image
abstract click to expand
At competitive speeds and spins, a table tennis ball follows complex, counterintuitive trajectories that a robot must track and precisely counter within fractions of a second. Training a reinforcement learning policy capable of these skills is prohibitively expensive and dangerous in the real world, making high-fidelity simulation essential. Transferability of such policies, however, critically depends on how faithfully the simulation captures real-world dynamics - a requirement made even more stringent by the adversarial nature of the game, where any modeling inaccuracy becomes an exploitable weakness for the opponent. Prior state-of-the-art in robot table tennis generally focuses on a limited range of velocities and spins and fails to capture the richness of ball behaviors encountered in professional-level play. In this work, we present physics models for aerodynamic ball flight, ball-table contact, and ball-racket contact. that accurately capture the ball behavior over a vast range of speeds and spins relevant to the game. Specifically, we model drag and Magnus force coefficients as functions of Reynolds number and spin ratio in the aerodynamics equations. For the table contact model we model effects of ball buckling on the coefficient of restitution and incorporate residuals into the instantaneous point-contact models. For the racket contact model, we introduce a residual neural network component to complement coefficients related to normal and tangential coefficients of restitution as well as torsional spin damping. Evaluated on an unprecedentedly large dataset of competitive matches (277 games), the proposed models significantly reduces prediction errors (e.g., 59% median landing-position error reduction). The resulting models were used to train the RL policies for the first real-world robot table tennis AI agent capable of competing against professional players.
1 0
Top Pith
1
cs.IT 2026-06-26

Multi-distribution functionals reduce to integrals of coincidence divergences

by Akshay Balsubramani

All you need is log

Monotonicity under data processing and additivity on independent products force every such functional to an integral over four strata

Figure from the paper full image
abstract click to expand
Comparing two probability distributions is a basic building block of statistics and machine learning, and the right family is well understood: the R\'enyi divergences of order $\alpha\in[0,\infty]$ are the unique family monotone under data processing and additive on independent products. Many problems instead compare more than two distributions at once -- multi-population fairness, multi-prior PAC-Bayes bounds, multi-hypothesis testing -- and the right multi-distribution generalization of the R\'enyi family has been an open question. We characterize it. Every functional of $W$-tuples of distributions that is monotone under data processing and additive on independent products is a positive integral of multi-way coincidence divergences $C_{\alpha}(\pi_1,\dots,\pi_W) := -\log\int \pi_1^{\alpha_1}\cdots\pi_W^{\alpha_W}$ (with $\sum_k \alpha_k = 1$) over a parameter space with four strata: the simplex interior; mixed-sign exponent cones (the analogue of R\'enyi orders $>1$); a tropical boundary at infinity carrying max-divergences; and pairwise Kullback-Leibler edges at the simplex vertices. Each stratum is necessary -- the destination of an explicit data-processing-monotone, product-additive divergence the others cannot reproduce -- and each is a clean limit of simplex-interior atoms. The same family arises from several independent routes -- the structural axioms, Kolmogorov-Nagumo means with R\'enyi's entropy axiomatics, classical entropy characterizations, multi-hypothesis testing error exponents, and a multi-lottery betting interpretation -- structural evidence that this is the canonical multi-distribution R\'enyi calculus rather than an artefact of any one axiomatic input. The two-prior case recovers the standard R\'enyi result; a worked $W=3$ instance, numerical verification, and a conditional extension round out the treatment.
1 0
0
cs.CV 2026-07-03

LLM-planned 3D paths keep object identities stable in long videos

by Hanlin Wang, Hao Ouyang +11 more

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

By separating motion planning from rendering, the system maintains exact visual consistency even after objects leave the frame for extended

Figure from the paper full image
abstract click to expand
We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: https://worlddirector.github.io/
0
0
cs.CV 2026-07-03

Three alignments convert any input into consistent 4D

by Qiaowei Miao, Kehan Li +2 more

Alignment Is All You Need For X-to-4D Generation

Align4D matches video and multiview priors through distance search, joint constraints, and decoupled training.

Figure from the paper full image
abstract click to expand
Generative diffusion models excel at synthesizing high-quality images, videos, and 3D content under multimodal control. However, arbitrary user-defined modality-to-4D (X-to-4D) generation remains challenging due to the high cost of constructing diverse datasets and the limited scalability of existing methods. This paper presents Align4D, a flexible framework that translates any-modal input into coherent video-3D pairs, using video to guide 4D motion and 3D data to shape 4D geometry. Align4D introduces three key techniques: (1) Object Distance Alignment, which searches Video-Aligned and Multiview-Aligned Object Distances (VAOD/MAOD), respectively, to reconcile 4D renderings with video and the priors of multiview diffusion models; (2) Motion-Geometry Joint Alignment, which constrains known and unknown views through synchronized video and 3D inputs, ensuring consistent 4D generation; and (3) Asynchronous Optimization, which decouples Gaussian attribute and deformation network training to enhance motion and geometry fidelity. We further propose the X4D dataset, which integrates prompt, image, video, and 3D data for benchmarking. Experiments on X4D and Consistent4D demonstrate that Align4D achieves state-of-the-art quality and consistency in X-to-4D generation. Project page: https://miaoqiaowei.github.io/Align4D/.
0
0
cs.AI 2026-07-03

Stateful monitor halves gradual-attack evasion in persistent AI codebases

by Josh Hills, Ida Caspary +1 more

Distributed Attacks in Persistent-State AI Control

Link-tracker plus trajectory monitors drops evasion from 93% to 47% where no single monitor stops both gradual and concentrated attacks.

Figure from the paper full image
abstract click to expand
As AI coding agents become more autonomous, they increasingly ship code iteratively, with the codebase persisting across sessions. This persistence creates a new attack surface: a misaligned or prompt-injected agent can distribute attacks across pull requests (PRs) and time its payload for the PR with the best natural cover. To study the resulting dynamics, we introduce Iterative VibeCoding, a setting for AI control, the study of safely deploying capable but potentially untrusted AI. In Iterative VibeCoding, a coding agent builds software over a sequence of PRs in a persistent codebase while pursuing a covert side task. Our benchmark includes two task families: CLI tools and Flask web services, across 20 total task variations. We use Claude Sonnet 4.5 as the attack agent and GPT-4o as the monitor. We compare gradual attacks, which distribute the side task across PRs, against non-gradual attacks concentrated in a single PR. No single monitor is robust to both: which strategy evades best (success while evading the monitor) depends on the monitor type, so a defender cannot close off both gradual and non-gradual attacks with any one monitor. High evasion (>= 65%) generalizes across model attack agent backends (Sonnet 4.5, Gemini 3.1 Pro, Kimi K2.5), confirming this is a property of the persistent-state attack surface rather than a single model's capability. Evasion also remains high across state-of-the-art monitor models and the gap between gradual and non-gradual evasion widens for more capable models. We introduce a stateful link-tracker monitor that tracks suspicious buildup across PRs. On both task families, it detects gradual attacks substantially better than diff monitors that merely see more accumulated history. Combining this stronger monitor with trajectory monitors in a four-monitor ensemble reduces gradual-attack evasion from 93% under the weakest standard diff monitor to 47%.
0
0
cs.CV 2026-07-03

Plain pixel diffusion on point maps beats latent models for 3D geometry

by Haofei Xu, Rundi Wu +8 more

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

A ViT trained from scratch directly on raw patches yields sharper results without tokenizers or hybrid losses.

Figure from the paper full image
abstract click to expand
State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.
0
0
cs.CL 2026-07-03

Testbed shows unlearning often misses the parameters holding data

by Matteo Boglioni, Thibault Rousset +3 more

LACUNA: A Testbed for Evaluating Localization Precision for LLM Unlearning

LACUNA places PII in known weights so researchers can measure whether methods erase knowledge at the source or only change outputs.

Figure from the paper full image
abstract click to expand
LLMs memorize sensitive training data, including personally identifiable information (PII), creating a pressing need for reliable post hoc removal methods. Unlearning has emerged as a promising solution, with state-of-the-art(SOTA) methods often following a localize-first, unlearn-second paradigm that targets specific model parameters. However, existing benchmarks evaluate unlearning solely at the output level, leaving open the question of whether unlearning truly erases knowledge from a model's parameters or merely obfuscates it, a concern reinforced by the success of resurfacing attacks. To bridge this gap, we introduce LACUNA: the first unlearning testbed with ground-truth parameter-level localization. LACUNA injects PII of synthetic individuals into predefined parameters of 1B and 7B OLMo-based models via masked continual pretraining, enabling direct evaluation of whether unlearning targets the weights responsible for knowledge storage. We use LACUNA to benchmark current SOTA unlearning methods and find that, despite strong output-level performance, existing methods are highly imprecise and susceptible to resurfacing attacks. We further show that when localization is successful, even a simple gradient-based unlearning method achieves strong erasure and robustness to resurfacing attacks, highlighting the importance of precise unlearning. We release LACUNA to complement behavioral evaluations and drive further advances in robust, localization-based unlearning.
0
0
cs.LG 2026-07-03

Compile fuzzy functions into 23 MB weights

by Wentao Zhang, Liliana Hotsko +4 more

Program-as-Weights: A Programming Paradigm for Fuzzy Functions

A 0.6B interpreter running compiler-generated LoRA adapters matches a 32B model on fuzzy text tasks at 1/50th the memory, entirely offline.

Figure from the paper full image
abstract click to expand
Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.
0
0
cs.AI 2026-07-03

Simple threshold monitor matches advanced LLM safety checks

by Mona Schirmer, Metod Jazbec +4 more

Online Safety Monitoring for LLMs

Risk-calibrated thresholding on external verifier signals performs competitively on reasoning and red teaming tasks.

Figure from the paper full image
abstract click to expand
Despite alignment training, LLMs remain prone to generating unsafe outputs at deployment time. Monitoring outputs online and raising an alarm when safety can no longer be assumed is therefore critical. We study a simple real-time monitor that turns a verifier signal from an external model into an alarm decision by thresholding, with the threshold calibrated via risk control. In experiments on mathematical reasoning and red teaming datasets, we show that this simple design is competitive with more advanced monitors based on sequential hypothesis testing.
0
0
cs.AI 2026-07-03

Evidence replay lifts long-context reasoning in LLMs without training

by Yanjun Zhao, Ruizhong Qiu +7 more

ReContext: Recursive Evidence Replay as LLM Harness for Long-Context Reasoning

RECONTEXT replays model-selected evidence recursively before answering to better use 128K inputs across models.

Figure from the paper full image
abstract click to expand
Understanding and reasoning over long contexts has become a key requirement for deploying large language models (LLMs) in realistic applications. Although recent LLMs support increasingly long context windows, they often fail to use relevant evidence that is already present in the input, revealing a gap between context access and effective context utilization. In this work, we propose Recursive Evidence Replay as LLM Harness for Long-Context Reasoning (RECONTEXT), a training-free inference method for improving long-context reasoning. RECONTEXT uses model-internal relevance signals to construct a query-conditioned evidence pool and replays it before final generation while preserving the full original context. This recursive selection process separates evidence organization from answer generation without training, external memory, or context pruning. We also provide a theoretical analysis based on associative memory, which characterizes the context as a memory store, the question as a retrieval cue, attention as cue-trace association, and replay as trace reactivation. Experiments on eight long-context datasets with 128K context length show that RECONTEXT consistently improves evidence utilization across Qwen3-4B, Qwen3-8B, and Llama3-8B, achieving the best average rank on all three backbones. Code is available at https://github.com/Yanjun-Zhao/ReContext.
0
0
cs.CV 2026-07-03

Blocking cross-noise attention leaves or raises diffusion model performance

by Dengyang Jiang, Mengmeng Wang +2 more

From SRA to Self-Flow: Data Augmentation or Self-Supervision?

Dual-timestep inputs without interactions between noise levels match or beat full Self-Flow, pointing to augmentation as the driver.

Figure from the paper full image
abstract click to expand
Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep input as Self-Flow while blocking attention between tokens assigned to different noise levels. Surprisingly, removing such interaction does not degrade performance and can even improve it, suggesting that the improvement from SRA to Self-Flow mainly comes from data augmentation. Furthermore,We show that Attention Separation itself provides an augmentation effect by splitting a single image into multiple effective training parts to expand the training data. Based on these observations, we combine self-representation alignment with dual-timestep and attention-separation augmentation, and demonstrate the effectiveness of this design on ImageNet.
0
0
cs.AI 2026-07-03

Social context produces 40% public-private split in LLM agents

by Arman Ghaffarizadeh, Danyal Mohaddes +2 more

What LLM Agents Say When No One Is Watching: Social Structure and Latent Objective Emergence in Multi-Agent Debates

Dual-channel tests show relational pressures create decision divergence absent from isolated prompts

Figure from the paper full image
abstract click to expand
LLM agents will increasingly act in socially structured settings where role, audience, and relational context can shape what is advantageous or costly to say. We study whether such social structure, without any explicit objective in the prompt, changes what an agent expresses publicly relative to an off-the-record (OTR) channel elicited under the same condition. We introduce a dual-channel debate framework in which agents produce public utterances that enter the shared history alongside OTR responses that are recorded but never shown to the other participant. Across 10 models, 3 scenarios, and 5 variations within each scenario, alignment-inducing settings produce systematic public-OTR divergence in the targeted agent, with its decision divergence rising from a $\sim$3% baseline to roughly 40%. The effect is consistent across four aggregate analyses: stance, semantic similarity, natural language inference, and survey responses. In some cases, the OTR response explicitly attributes public accommodation to relational pressures, such as career risk or sponsorship obligation. The findings suggest that agent evaluation should extend beyond explicit goals and detect emergent objectives. We present a dual-channel evaluation framework and complementary behavioral measures that operationalize this assessment.
0
0
cs.CL 2026-07-03

Reasoning model boosts speaker ID accuracy in TV dramas

by Yuxuan Li, Lingxi Xie +7 more

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Multimodal tool-use lets it handle short lines where audio alone fails, on a new 532K-line benchmark.

Figure from the paper full image
abstract click to expand
Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: https://www.github.com/198808xc/DramaSR-LRM.}
0
0
cs.RO 2026-07-03

Tactile deformation prediction lifts contact task success to 71.67%

by Shuai Tian, Yupeng Zheng +8 more

VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation

A flow-matching model with asymmetric attention and contact gating outperforms prior visual-tactile methods across six real manipulation tas

Figure from the paper full image
abstract click to expand
Contact-rich manipulation requires policies to react to local deformation, pressure, slip, and friction, yet these cues are temporally sparse and often invisible in visual observations. Existing visual-tactile policies usually feed tactile observations directly into action prediction, but rarely model tactile deformation dynamics during action generation. In this paper, we introduce VT-WAM, a Visual-Tactile World Action Model that jointly learns future visual prediction, tactile deformation prediction, and action prediction within a unified flow matching framework. In particular, VT-WAM introduces (1) Asymmetric Mixture-of-Transformers (MoT) attention to bridge a first-frame visual anchor with temporal tactile dynamics, and (2) contact-gated Action-Visual-Tactile Attention Guidance (AVTAG) to encourage action queries to rely on tactile evidence during contact phases. Across six real-world contact-rich manipulation tasks, VT-WAM achieves a 71.67% average success rate, outperforming Fast-WAM by 26.67% and OmniVTLA by 35.84%. Ablations demonstrate that modeling tactile deformation dynamics and guiding contact-phase tactile attention are both important for contact-rich tasks. Project website: https://vt-wam.github.io/.
0
0
cs.LG 2026-07-03

DemoPSD reduces leakage in LLM self-distillation via adaptive barycenters

by Yunhe Li, Hao Shi +6 more

DemoPSD: Disagreement-Modulated Policy Self-Distillation

Blending teacher and student distributions by per-token disagreement preserves exploration and improves cross-domain performance on scientif

Figure from the paper full image
abstract click to expand
On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce **DemoPSD**, a novel framework that resolves such problems through the idea of *selective adoption of teacher guidance*. Instead of fitting the full teacher distribution, DemoPSD steers the student toward a *reverse-KL barycenter target*, a weighted geometric combination of the teacher and student distributions, that naturally balances learning from the teacher with preserving the student's own reasoning capacity. We measure the difference between their distributions and use such a discrepancy to adaptively control the blending at each token position. We provably show that DemoPSD achieves **(1)** *leakage attenuation*, i.e., effective mitigation of privileged information leakage; and **(2)** *exploration preservation*, i.e., preservation of exploration capacity under dense token-level distillation. Extensive experiments on SciKnowEval across four scientific fields show that DemoPSD outperforms both GRPO and SDPO while maintaining higher training entropy and robustly generalizing to out-of-distribution GPQA benchmarks.
0
0
cs.RO 2026-07-03

C++ runtime runs embodied models on any robot hardware

by Ling Xu, Chuyu Han +7 more

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Five-layer abstraction replaces per-model Python stacks while preserving 91-100 percent task success and cutting memory use by more than two

abstract click to expand
Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.
0
0
cs.LG 2026-07-03

SOAP and SOAP-Muon beat Adam on ML interatomic potential training

by Gil Harari, Yoel Zimmermann +5 more

Beyond Adam: SOAP and Muon for Faster, Label-Efficient Training of Machine Learning Interatomic Potentials

Matrix optimizers reach higher accuracy in fewer steps, with largest gains under partial force labels.

Figure from the paper full image
abstract click to expand
Machine learning interatomic potentials (MLIPs) have become a hallmark of AI for scientific simulation. While efforts on new architectures and datasets have led to increasingly accurate and general models, the choice of optimizer for training has largely remained unexplored, defaulting to Adam and its variants in the community. Here, we implement and systematically compare a class of recently proposed matrix-structured optimizers, including Muon, SOAP, and the hybrid SOAP-Muon, for training NequIP and Allegro MLIP models. We find that these optimizers can substantially outperform Adam in both convergence speed and final accuracy. SOAP and SOAP-Muon emerge as robust and consistently strong methods, while Muon only provides partial gains relative to Adam. The improvements are particularly pronounced under partial force supervision. Our results indicate that optimizer choice is an overlooked yet impactful design axis for MLIPs.
0
0
cs.CV 2026-07-03

Spatial memory guides efficient 360° object search and segmentation

by Song Tang, Shuming Hu +3 more

Seek to Segment: Active Perception for Panoramic Referring Segmentation

PanoSeeker folds narrow views into one panorama so an agent can avoid repeat looks and align for an accurate mask.

abstract click to expand
Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($\Delta\theta, \Delta\phi$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.
0
0
cs.RO 2026-07-03

Behavior latents add independent speed and safety control to traffic sims

by Juanwu Lu, Junyu Zhu +1 more

Controllable Sim Agents with Behavior Latents

CNeVA matches top imitation models on Waymo data while enabling monotone per-channel steering without reward hacks.

Figure from the paper full image
abstract click to expand
Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curriculum for classifier-free guidance. To tackle scarcity in reward signals, we propose soft eligibility gates that replace hard binary thresholds with smooth exponential decay, preserving the gradient signal for near-threshold agents. On the Waymo Open Motion Dataset, CNeVA attains competitive realism on the benchmark while exposing per-channel controllability that the higher-ranked imitation models lack. Speed- and acceleration-based steering produces monotone responses without stall-induced reward hacking. Safety controllability is monotone and substantial with the introduction of soft eligibility. We manage to achieve steerable map compliance under a context-residual return measure. Furthermore, our experiment demonstrates that steering metrics must be read alongside physical-plausibility guardrails to avoid reward-hacking confounds.
0
0
cs.CV 2026-07-03

Tuning a few attention heads blocks text attacks on vision models

by Bohan Liu, Wenqian Ye +4 more

Towards Robustness against Typographic Attack with Training-free Concept Localization

Sampling attribution finds lexical-encoding circuits in ViT; direct weight adjustments raise accuracy on attacked images without retraining.

Figure from the paper full image
abstract click to expand
Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR.
0
0
cs.AI 2026-07-03

Neural guidance zeros conflicts in Sudoku solvers when hints can be overwritten

by Timo Bertram, Sidhant Bhavnani +4 more

G-RRM: Guiding Symbolic Solvers with Recurrent Reasoning Models

SE-RRMs cut median conflicts to zero and speed backtracking 33 times and Glucose 1.7 times on 9x9 grids if solvers can override bad proposal

Figure from the paper full image
abstract click to expand
In this work, we focus on SE-RRMs, a symbol-equivariant instantiation of RRMs that exhibits improved extrapolation to larger problem sizes. We propose a neuro-symbolic approach, ``Guiding with Recurrent Reasoning Models'' (G-RRM), which integrates SE-RRMs with symbolic solvers for constraint satisfaction problems. SE-RRMs act as neural solvers that generate full solution proposals and guide classical symbolic solvers, such as backtracking or SAT-based methods like Glucose 4.1 and CaDiCaL 3.0.0, that produce globally correct solutions. Centrally, we investigate when neural guidance with G-RRM improves the search efficiency of symbolic solvers. % Our experiments show that the efficacy of G-RRM depends on two conditions: first, the problem instances must have an expansive combinatorial search space to expose potential gains, and second, the solver architecture must be capable of dynamically overwriting its branching choices to recover when neural hints are imperfect. When these conditions hold, guidance drives median conflict counts to zero and yields significant wall-clock speedups: on $9\times9$ Sudoku, where the SE-RRM correctly solves $91.1\%$ of instances, backtracking accelerates by $33.3\times$ and Glucose 4.1 by $1.70\times$ (median, $p<0.001$), with Glucose 4.1 retaining a $1.17\times$ speedup on perfect-hint $25\times25$ grids. In contrast, CaDiCaL 3.0.0, whose runtime is overhead-dominated and which always respects the injected branching hints rather than overwriting them, shows no significant speedup (median $1.02\times$, n.s.) and even a small significant mean slowdown ($0.90\times$) on $9\times9$. These results delineate the regimes in which neural guidance translates into practical speedups.
0
0
cs.CL 2026-07-03

Masked prefixes and replay boost VLM accuracy on new images

by Liyan Tang, Fangcong Yin +1 more

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

RL training forces models to correct errors using visual evidence instead of text patterns alone.

Figure from the paper full image
abstract click to expand
Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.
0
0
cs.DL 2026-07-03

Ipseome dataset releases largest free open human identity data

by Jason Jeffrey Jones

Building the Ipseome: Large, Free, Open, Human Identity Data

Assembled as reusable infrastructure with public repositories and versioned files to enable cumulative research.

abstract click to expand
Shared data accelerates scientific progress. Here, I describe the ipseome -- the largest free and open dataset on the topic of human identity. The dataset is designed as reusable research infrastructure, with publicly accessible data repositories, documented measurement procedures, and versioned files for cumulative research on identity. First, I present the motivation for and the ipseological principles driving construction of the ipseome. Then, each component is introduced and discussed. Finally, I summarize the current state of progress toward the ultimate goal.
0
0
cs.CV 2026-07-03

Descriptor-free localization cuts rotation error by 89%

by Yejun Zhang, Xinjue Wang +3 more

GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training

GeoMix adds global context nodes and trains across detectors to narrow the gap with appearance-based methods while preserving privacy and st

Figure from the paper full image
abstract click to expand
Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability at three levels. Locally, directional and distance-aware embeddings enrich neighborhood aggregation with fine-grained spatial structure. Globally, learnable context nodes aggregate and redistribute scene-wide information via cross-attention to resolve ambiguities beyond local receptive fields. At the training level, Mix-Training exploits this detector-agnostic geometry space to learn representations across multiple keypoint detectors. Extensive experiments on MegaDepth, Cambridge Landmarks, 7Scenes, and Aachen Day-Night show that GeoMix sets a new state of the art among descriptor-free methods, reducing 75th-percentile rotation error by 89\% and translation error by up to 90\% over the previous best, while generalizing zero-shot to unseen detectors and narrowing the gap to descriptor-based pipelines. Code is available at $\href{https://github.com/YejunZhang/Geomix}{\text{this links}}$.
0
0
cs.CV 2026-07-03

Entropy filtering removes noise so VLMs keep detail with fewer tokens

by Xuehui Wang, Xuankun Yang +1 more

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Vision-language models preserve fine-grained cues under strict token budgets by cleaning scores first then applying spatially-aware submodul

Figure from the paper full image
abstract click to expand
Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.
0
0
cs.CV 2026-07-03

Global priors raise 360° search accuracy nearly eightfold

by Jingtao Xu, Zizhuo Lin +3 more

EAGLE-360: Embodied Active Global-to-Local Exploration in 360^circ

Starting from a full panoramic view and narrowing locally replaces myopic scans and improves error recovery in wrapped environments.

Figure from the paper full image
abstract click to expand
While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.
0
0
eess.SY 2026-07-03

Sliding mode law docks 3D vehicles using range and sight angles

by Ram Milan Kumar Verma, Shashi Ranjan Kumar +1 more

Docking of Autonomous Vehicles with a Stationary Docking Station in 3D Space

Finite-time controller aligns orientation and brings speed to near zero for safe station approach.

Figure from the paper full image
abstract click to expand
In this letter, we present a strategy for autonomous docking of autonomous vehicles in three-dimensional space. Docking is a safety-critical task and requires expert piloting skills. Vehicles with autonomous docking capabilities are highly desirable in various applications, such as marine vehicle docking, aerial vehicle docking, spacecraft docking, and landing. To dock autonomously with the docking station, the vehicle must align itself to a specific desired orientation relative to the docking station and also reduce speed as it approaches. The vehicle achieves near-zero speed to dock successfully and safely without colliding with the docking station. Inspired by the philosophies from the guidance literature, we present a finite-time sliding mode-based strategy to achieve the same. The range and line-of-sight kinematics relations describing the motion of the vehicle with respect to the stationary docking station are used to steer the vehicle to achieve the desired orientation for docking. This docking strategy is validated in MATLAB\textsuperscript{\textregistered} simulations for various initial locations and orientations of both the vehicle and the docking station.
0
0
math.PR 2026-07-03

Subcritical percolation gives spin mixing time log N / lambda

by Alexandre Stauffer, Oskar Vavtar

Mixing times of spin systems on dynamical percolation

When edge flips are slow the combined chain equilibrates in time proportional to log of system size divided by the flip rate.

abstract click to expand
We study the mixing times of stochastic spin systems corresponding to nearest-neighbour Glauber dynamics on dynamical percolation, defined on $d$-dimensional torus of side-length $N$. In this model, the status of each edge (open or closed) updates independently at rate $\lambda>0$, according to $\mathrm{Ber}(p)$ samples. Simultaneously, the spin of each site updates at rate $1$ according to Glauber dynamics on the environment restricted to open edges. We show that for a relatively general class of nearest-neighbour systems, as long as $p<p_c(d)$, for any temperature, if $\lambda$ is sufficiently small, the mixing time is of order $\frac{\log N}{\lambda}$. This Markov chain is non-reversible, and the proof is obtained by developing a particular coupling that couples together local configurations whenever the environment behaves well.
0
0
cs.RO 2026-07-03

QuadRocket achieves almost global trajectory tracking with adaptive control

by Pedro Santos, Joel Reis +2 more

QuadRocket: An Aerial Robotic Testbed for Adaptive Thrust-Vector Control of Rocket-Like Vehicles

The quadrotor-based rocket prototype models the vehicle as an axisymmetric body to enable disturbance rejection in thrust-vector control.

Figure from the paper full image
abstract click to expand
This paper presents QuadRocket, a quadrotor-based rocket prototype that provides a low-cost, low-risk platform for validating advanced thrust-vector control strategies for launch vehicle-type systems. The prototype consists of a cylindrical main body mounted on top of a quadrotor through a universal joint, forming a flying inverted pendulum with non-negligible inertia. For control design, the coupled system is modeled as a single axisymmetric rigid body actuated by a vectored force applied along its longitudinal axis. A reduced-attitude representation on the two sphere is adopted to explicitly exploit the vehicle's axial symmetry and to decouple yaw from the thrust-vector direction. On this model, we derive an adaptive backstepping controller that achieves almost global trajectory tracking in the presence of unknown constant disturbances, while a control-point transformation mitigates non minimum-phase behavior. The quadrotor is then treated as a thrust vector actuator, and a dynamic-surface-based attitude controller is designed to track the desired thrust-vector, accounting for actuation dynamics and avoiding explicit differentiation of virtual control signals. The complete architecture is evaluated in simulation and validated experimentally in an indoor motion-capture arena. Results demonstrate accurate trajectory tracking, effective disturbance compensation, and confirm the suitability of the QuadRocket as a versatile testbed for thrust-vector-controlled robotic vehicles.
0
0
cs.CL 2026-07-03

Narration acoustics predict audiobook appeal beyond title

by Shahar Elisha, Mariano Beguerisse-Díaz +1 more

Audio-Based Understanding of Audiobook Narration Appeal

Vocal features extracted from recordings remain tied to view-rate and engagement after title controls are applied.

abstract click to expand
Narration is central to the audiobook listening experience, shaping how listeners engage with and understand the content. This work explores how narration qualities shape an audiobook's appeal, noting that their effects can vary by genre, title, and audience. We extract vocal and acoustic features (e.g., tone, pace, loudness) from LibriVox using pre-trained audio models and analyse their relationship with consumption data (specifically, view-rate) and their interplay with genre and title. Despite limited consumption data, we find that acoustic information alone has a robust association with appeal, even after accounting for title effects. We further validate these findings using more nuanced proprietary engagement metrics. To our knowledge, this is the first systematic computational study linking narration qualities, genre, title, and audiobook consumption, highlighting the potential of data-driven insights to improve audiobook personalisation and narrator casting.
1 0
0
cs.RO 2026-07-03

Quadrotor learns to intercept using only camera direction vectors

by Michael Anoruo, Xiaoyu Tian +6 more

Learning Agile Intruder Interception using Differentiable Quadrotor Dynamics

Differentiable dynamics replace point-mass models and raise success rate by 30 percent at speeds up to 10 m/s without needing distance or po

Figure from the paper full image
abstract click to expand
This paper presents a methodology for learning a control policy to intercept an intruder using the 3D direction unit vector to the intruder and the interceptor state. Prior deep reinforcement learning approaches assume either relative position or distance to the intruder is available, but this information is not readily accessible in real-world applications that employ passive, monocular camera sensors. Instead, we propose a solution that leverages an analytical policy gradient method using differentiable quadrotor dynamics to learn agile interception at speeds up to 10 m/s. The proposed approach outperforms baseline methods that utilize simplified point mass dynamics by an average of 30%.
0
0
cs.CV 2026-07-03

Anchored flow removes clouds while keeping semantics intact

by Ziyao Wang, Maonan Wang +6 more

Interpretation-Oriented Cloud Removal via Observation-Anchored Residual Flow with Geo-Contextual Alignment

By tying reconstruction to the original observation and a vision-model manifold, the method raises both image quality and accuracy on segmen

Figure from the paper full image
abstract click to expand
Cloud removal (CR) is essential for optical remote sensing, serving as a prerequisite for reliable downstream interpretation, such as semantic segmentation and change detection. However, existing CR approaches often prioritize visual realism while overlooking their impact on subsequent analytical tasks, leading to semantic drift and degraded downstream performance. To address this issue, we propose Geo-Anchored Cloud Removal (GACR), a unified framework that jointly ensures faithful reconstruction and robust interpretability. At its core, GACR incorporates Observation-Anchored Residual Flow (OAR-Flow), which reformulates CR as a physically grounded residual inversion process. By anchoring the generative trajectory to the cloudy observation rather than pure noise, OAR-Flow enables fast, stable, and faithful reconstruction. To further preserve semantic structures critical for downstream interpretation, GACR integrates Geo-Contextual Prior Alignment (GCPA) to constrain the reconstruction within a semantic manifold induced by a Vision Foundation Model (VFM). Consequently, GACR strictly maintains the spatial-semantic integrity of complex landscapes. Extensive experiments across six CR datasets and twelve downstream tasks demonstrate that GACR produces superior reconstruction quality while consistently improving downstream task accuracy. The code is available at https://github.com/wzy6055/GACR.
0
0
cs.SE 2026-07-03

Benchmark tracks test changes after code commits

by Jiale Amber Wang, Kaiyuan Wang +1 more

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

TestEvo-Bench uses real commit data and execution checks to measure agent success at 77 percent on generation and 74 percent on updates.

Figure from the paper full image
abstract click to expand
Software tests and code evolve together: a code change should be followed by new or updated tests that record the new software behavior. Yet existing test generation and update benchmarks often isolate the test from the code change, and rely on static metadata that does not verify whether a test is executable or semantically tied to the code change. This makes it difficult to evaluate whether a test automation agent understands how a code change should propagate into the test suite. We introduce TestEvo-Bench, a benchmark of test and code co-evolution tasks mined from software repositories, with two tracks: in test generation, the agent shall write new tests to capture the new software behavior; in test update, the agent shall adapt failing existing tests to the changed software behavior. Each task is anchored to a real commit history and packaged with environment configuration to support execution-grounded metrics such as pass rate, coverage, and mutation score. TestEvo-Bench is also a live benchmark: each task records the timestamp of the test and code changes, and new tasks are periodically mined by our automated pipeline, so evaluation can be restricted to tasks postdating a model's training cutoff to reduce data leakage risk. The current snapshot contains 746 test generation and 509 test update tasks, curated from 59,950 candidate co-evolution records across 152 open-source Java projects. We experiment with four state-of-the-art agents that combine strong harnesses (Claude Code, Gemini CLI, and SWE-Agent) with strong foundation models (Claude Opus 4.7 and Gemini 3.1 Pro). Results show that they achieve up to 77.5% success rate on test generation and 74.6% on test update. However, success rate is materially lower on the most recent benchmark tasks and drops significantly under limited per-task cost.
0
0
cs.CY 2026-07-03

Collaborative traits decide who beats AI models in forecasts

by Vivienne Ming

Human Capital, Not Model Benchmarks, Predicts Hybrid Intelligence in Forecasting

Prediction market data shows most people match or lag the model, while those high in humility and curiosity reach or exceed market accuracy.

abstract click to expand
Whether pairing people with AI helps or hurts is usually reported as a single average effect. Using a real-money prediction market (Polymarket) as an objective, externally resolved benchmark, this pilot shows that the value of human-AI collaboration depends on a specific, measurable form of human capital. Analyzed at the level of the individual forecaster, hybrid performance is trimodal: most people either deferred to the model (matching it) or used it to rubber-stamp a prior guess (performing worse than the model alone), while a minority engaged in genuine complementary reasoning and reached accuracy matching or even exceeding (i.e., lower error than) the market itself. Collaborative traits (perspective-taking, intellectual humility, and curiosity) rather than raw cognitive ability or model benchmarks, distinguished who reached that mode. The results are preliminary but statistically robust, and motivate a pre-registered replication now in preparation.
0
0
cs.RO 2026-07-03

Unlabeled robot play pretraining matches million-expert VLA models

by Junhao Shi, Siyin Wang +4 more

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

By learning movement skills first from cheap interactions then aligning to language with minimal labels, the method cuts labeled data needs

abstract click to expand
Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap, unlabeled interaction data -- including discarded off-task trajectories and autonomous robot play -- via a self-supervised Inverse Dynamics objective. A lightweight second stage then grounds these priors in language using minimal expert data. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, yielding a 10% absolute gain over standard behavior cloning. On a real-world WidowX platform, TAP retains 25% success under camera perturbations where internet-scale baselines collapse to 0%, demonstrating that task-agnostic pretraining produces robust, transferable physical representations and offers a scalable path forward for Embodied AI.
0
0
cs.AR 2026-07-03

p-MEM samples Gaussians at full memory bandwidth

by Likai Pei, Jiahao Zheng +10 more

Probabilistic Memory for Trustworthy Edge Intelligence

Storing distribution parameters cuts sampling latency by hundreds of times and energy by up to 295x in Bayesian network workloads on CPU and

Figure from the paper full image
abstract click to expand
Probabilistic computation plays an important role in trustworthy edge intelligence to quantify uncertainty, enhance robustness, reconstruct data, and protect privacy, but its adoption is limited by the orders-of-magnitude data throughput gap between Gaussian random number generation (GRNG) and computation, as well as instruction overhead. This paper introduces probabilistic memory (p-MEM), a unified memory primitive that stores distribution parameters, such as mean and standard deviation, and samples directly at the native memory bandwidth, where deterministic data becomes the zero-variance special case. Using a layout-validated p-MEM simulator, we comprehensively explore device choices, memory specifications, and technology nodes, showing that p-MEM can achieve more than 1000 GSa/s/mm^2 GRNG throughput, including memory-array access. Integrated into CPU/GPU systems, p-MEM reduces instruction count by up to 2.19x/4.37x, sampling latency by 562x/3.45x, and energy by 295.5x/3.53x for Bayesian neural network workloads, providing a scalable hardware substrate for trustworthy probabilistic AI.
0
0
cs.CL 2026-07-03

Scaling boosts most LLM social simulations but stalls on biases

by Caleb Ziems, William Held +4 more

Will Scaling Improve Social Simulation with LLMs?

Tests on 120 models show rapid gains for common opinions yet slower or absent progress on forecasts and risk aversion.

Figure from the paper full image
abstract click to expand
Large Language Model (LLM) social simulations are a promising research method, but they are not yet faithful enough to be adopted widely. In this work, we investigate whether the current scaling paradigm in language modeling is likely to close these gaps, or whether simulation fidelity is orthogonal to general capabilities and therefore deserving of more research attention. We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal forecasting. Surprisingly, we discover strong compute scaling in all three settings, using a suite of 85 transformer LLMs with the Qwen3 architecture pre-trained on the DCLM web text corpus under fixed-compute budgets from $10^{18}$ to $10^{20}$ FLOPs. Then we evaluate 35 larger and more capable open-weight models up to 70B parameters, allowing us to predict downstream accuracy from loss. This reveals that the majority of behavioral and opinion simulation tasks will rapidly improve with scale, particularly when they involve populations that are well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly, especially when they are less correlated with general knowledge and reasoning benchmarks like MMLU. In behavior simulation, scaling fails to improve model calibration with human cognitive biases like risk aversion, as well as human heuristics like learning correlated rewards from related tasks. On these tasks, even fine-tuned models fail to noticeably scale up performance from 0.5B to 8B parameters. Taken together, we conclude that scale will improve social simulations in most settings, but outliers exist, and improvements will be less reliable in low-resource domains.
0
0
cs.CV 2026-07-03

Rotation fixes one codebook for all DiT steps

by Donghyun Lee, Jitesh Chavan +6 more

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

A randomized permuted block-Hadamard transform creates a data-independent basis that serves every timestep and modality.

Figure from the paper full image
abstract click to expand
Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.
0
0
cs.LG 2026-07-03

Neuron activations select data for label-free LLM self-distillation

by Zhuowei Chen, Xiang Lorraine Li

Neuron-Aware Data Selection for Annotation-Free LLM Self-Distillation

The approach raises specialized-task accuracy while avoiding the out-of-domain drop and calibration problems of earlier output-only methods.

Figure from the paper full image
abstract click to expand
Post-training large language models (LLMs) without real-world interaction feedback or human-labeled supervision remains challenging, particularly in specialized domains where expert annotations are costly to obtain. Recent annotation-free self-evolution methods address this by using the model's own outputs as supervision signals, constructing a teacher via additional context and aggregating predictions across multiple rollouts through majority voting to produce pseudo-labels. However, these approaches are not without drawbacks: SFT- and GRPO-based variants suffer out-of-domain performance degradation, while reward-based on-policy RL inflates calibration error. In this paper, we propose Neuron On-Policy Self-Distillation (Neuron-OPSD), a data-centric framework for annotation-free self-distillation that leverages internal neuron activations to guide both training-data selection and teacher context construction. The model is then trained via on-policy distillation from the teacher distribution, requiring no ground-truth labels at any stage. Across specialized-domain benchmarks, Neuron-OPSD improves in-domain task performance while preserving cross-domain generalization and mitigating calibration collapse over prior annotation-free baselines. This framework is particularly relevant to settings where online interaction or external supervision is costly or infeasible, and is conceptually distinct from offline RL approaches that rely on logged, reward-labeled trajectories.
0
0
cs.CL 2026-07-03

Language models shape the culture they measure

by Kent K. Chang

Language Models as Measurement Apparatus for Culture

The apparatus of model, data, annotation, and evaluation draws boundaries that define what counts as cultural reality.

Figure from the paper full image
abstract click to expand
Language models are increasingly used to quantify cultural phenomena, but what makes such measurement distinctively cultural? This paper argues that NLP work on culture is a material-discursive practice: the apparatus -- model, data, annotation, evaluation -- participates in constituting the cultural reality it measures, rather than passively recording it. Drawing on Karen Barad's concept of the agential cut -- the contingent boundary between phenomenon and instrument -- I show that the apparatus's substantive design choices draw such boundaries, and that the boundary is entangled from the start because language models have already internalized much of the cultural material they measure. I illustrate this through three case studies on television and film dialogue (measuring structure, interaction, and deviation) and three examinations of the apparatus itself (erasure of cultural markers, attunement to historical material, and agency in an agentic workflow). This big picture analysis proposes a research program that is theory-driven, empirically rigorous, and culturally contingent, treating each agential cut as a conscious commitment, at once methodological and ethical.
0
0
eess.SY 2026-07-03

Torque tuning steers nonholonomic vehicle to source orbit

by Bo Wang

Nonholonomic Source Seeking by Torque Tuning: Local and Semi-Global Feedbacks

Two feedback laws achieve local and semi-global stability from scalar sensor data alone, without position or gradient information.

Figure from the paper full image
abstract click to expand
This paper studies source seeking for a torque-controlled nonholonomic vehicle with a laterally displaced scalar sensor. The vehicle has constant forward speed, while its yaw motion is controlled by torque input with unknown inertia and damping. The objective is to steer the vehicle to a source-centered circular motion so that the lateral sensor approaches the unknown source, without using position, heading, source-location, gradient, or source-value information. The proposed torque law combines a fast oscillatory component, which generates averaged steering through symmetric-product approximation, with a slowly tuned bias component, which selects the desired orbit. Two bias-tuning designs are developed. The first is an output-feedback design using only the scalar measurement; it applies a Lie-bracket extremum-seeking update and yields local practical stability. The second is a velocity-assisted design using forward-speed and yaw-rate measurements; it tunes the bias through the yaw-rate tracking error and yields a globally asymptotically stable averaged system, implying semi-global practical stability of the original system. Simulations illustrate the proposed designs.
0
0
cs.HC 2026-07-03

Only some LLMs link personality to visualization colors

by Shahreen Salim, Klaus Mueller

When Do LLM Personas Support Visualization Design? A Cross-Model Study of Color Assignment and Chart Choice

Chart top choices match no-persona baselines in eight of nine tests, showing task context drives selection more than personality.

Figure from the paper full image
abstract click to expand
Large language model personas are increasingly used to approximate diverse users during early-stage visualization design, but it remains unclear whether persona-conditioned outputs reflect stable personality effects or artifacts of model choice and task framing. We examine this question across two visualization-relevant tasks: color assignment for abstract and concrete concepts, and chart-idiom preference ratings across task contexts. Using 43 Big Five profiles across GPT-4o-mini, GPT-4.1-mini, and GPT-5-mini, we find that personality-color coupling is highly model-configuration dependent: absent in GPT-4o-mini for all six concepts, consistent in GPT-4.1-mini across all six, and partial in GPT-5-mini for two of six. Concept type further shapes the signal: for abstract concepts, personality explains more hue variance than model identity, while concrete concepts show smaller and comparable effects. In chart choice, trait-aligned cluster aggregation produces stable top-idiom rankings across all nine cluster-context combinations, but a no-persona baseline recovers the same top choice in 8 of 9 model-context cells, indicating that task context drives rank-1 selection more than personality. These findings position LLM personas as exploratory probes for visualization design, not substitutes for human participants, and motivate multi-model testing, concept-type disaggregation, and no-persona baselines in future studies.
0
0
cs.MA 2026-07-03

Stars mislead on AI agent framework health

by Xi Zhang (Cisco Systems), Papi Menon (Cisco Systems) +2 more

Adoption and Ecosystem Health: A Longitudinal Analysis of Open-Source Multi-Agent Frameworks

Analysis of 15 frameworks shows contributor density and retention better track adoption than hype-driven star counts.

Figure from the paper full image
abstract click to expand
Since ChatGPT's launch in November 2022, open-source agentic frameworks have proliferated, making framework selection important for engineering teams while obscured by popularity signals such as GitHub stars. This paper analyzes 15 major open-source AI agent framework repositories from late 2022 to early 2026, using 808,042 stars, 73,997 pull requests, 86,241 commits, and 987,330 user profiles to assess ecosystem health across awareness, adoption, and retention. Three findings emerge. First, headline popularity is unreliable. Star counts reflect hype cycles and inorganic activity. AutoGPT gained 111,967 stars in one month but converted fewer than 9 contributors per 1,000 stars, defined as contributor density in this research, compared with LangChain's 41. Lower-profile frameworks such as Pydantic-AI show higher contributor density, indicating deeper adoption. Second, mapping awareness against adoption shows that visibility and engagement diverge. MetaGPT and LangFlow have contributor density ratios below 5 even with their high visibility. Openai-agents-python's limited contributor base suggests institutional backing alone does not ensure community depth. By analyzing cross-framework contribution, we discover that LangChain functions as a shared infrastructure, attracting 82.5% of cross-ecosystem contributors. Third, retention drops most steeply in the first 30 days of initial contribution and stabilizes near 90 days. Overall, ecosystem health is better measured by contributor density, cross-ecosystem engagement, and retention than by stars alone. These metrics offer teams a more robust basis for framework evaluation.
0
0
cs.CR 2026-07-03

Taxonomy organizes factors in cybersecurity incident response

by Thomas Biege, Marius Brockhoff +4 more

SoK: A Taxonomy for Cybersecurity Incident Response Influence Factors

Review of 457 publications yields a more complete classification than seven prior frameworks and NIST elements.

Figure from the paper full image
abstract click to expand
Cybersecurity incident response has emerged as a critical area of interest for both researchers and practitioners. The corpus of literature on cybersecurity incident response is expanding, yet a unified framework for systematically organizing the accumulated knowledge remains absent. The aspects of incident response span multiple domains, including technology, human-computer interaction, organizational theory, and human factors. A comprehensive, integrative perspective on these factors can enable researchers to identify underexplored areas and more effectively target their empirical and theoretical investigations. Our study systematizes the factors that influence organizational preparedness for and response to cybersecurity incidents. Through a systematic review of academic literature (n = 417) and non-scientific publications (n = 40), we derived the "Cybersecurity Incident Response Influencing Factor Taxonomy" (\textit{CIR-IF Taxonomy}). Existing empirical findings were classified within this taxonomy, providing a comprehensive and up-to-date overview of knowledge from the period 1999 to mid-2024. The taxonomy categories were systematically compared with seven established scientific frameworks and with the \textit{NIST Cyber Security Framework} elements referenced in the \textit{NIST Special Publication 800-61r3} incident response profile. The results of this comparison show that the \textit{CIR-IF Taxonomy} delivers a richer, more rigorous, and more systematically organized view of the factors that drive and shape incident response.
0
0
cs.MA 2026-07-03

Multi-agent LLMs fix overhangs in FDM CAD models

by Emmanuel George, Christopher Keefe +2 more

AgentsCAD: Automated Design for Manufacturing of FDM Parts via Multi-Agent LLM Reasoning and Geometric Feature Recognition

System parses STEP files, builds topology graphs, and outputs modified geometry plus reports after LLM reasoning and visual checks.

Figure from the paper full image
abstract click to expand
Parts manufactured with Fused Deposition Modeling (FDM) often require Design for Additive Manufacturing (DFAM) modifications to ensure printability, structural integrity, and reduced post-processing. Current slicers identify defects such as steep overhangs but are unable to modify the underlying geometry. This work presents AgentsCAD, a multi-agent system that bridges raw boundary-representation (B-Rep) geometry and Large Language Model (LLM) reasoning to automate targeted DFM. The workflow begins by parsing a STEP file. The agentic system detects overhangs above a 45{\deg}threshold, constructs a face-adjacency topology graph, and optionally injects semantic feature labels from a GraphSAGE model trained on MFCAD++ (59,665 parts), before dispatching a Claude Sonnet design-reasoning agent that recommends reorientations, fillets, chamfers, and similar modifications. A GPT-4o vision-language verifier inspects rendered views to confirm geometric integrity. Outputs include a modified STEP file and a human-readable report. A test case on a birdhouse model demonstrates that the system correctly diagnoses overhangs, selects appropriate defect mitigation strategies, and proposes physically valid corrections, partially solving the geometry-to-language translation problem central to LLM-driven CAD modification.
0
0
cs.LG 2026-07-03

MIM pre-training resists non-IID data better than contrastive learning

by Xuanyu Chen, Nan Yang +2 more

Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data

Theoretical analysis shows decentralized SSL robustness grows with network connectivity, placing federated learning on equal footing.

Figure from the paper full image
abstract click to expand
Recent research has introduced distributed self-supervised learning (D-SSL) approaches to leverage vast amounts of unlabeled decentralized data. However, D-SSL faces the critical challenge of data heterogeneity, and there is limited theoretical understanding of how different D-SSL frameworks respond to this challenge. To fill this gap, we present a rigorous theoretical analysis of the robustness of D-SSL frameworks under non-IID (non-independent and identically distributed) settings. Our results show that pre-training with Masked Image Modeling (MIM) is inherently more robust to heterogeneous data than Contrastive Learning (CL), and that the robustness of decentralized SSL increases with average network connectivity, implying that federated learning (FL) is no less robust than decentralized learning (DecL). These findings provide a solid theoretical foundation for guiding the design of future D-SSL algorithms. To further illustrate the practical implications of our theory, we introduce MAR loss, a refinement of the MIM objective with local-to-global alignment regularization. Extensive experiments across model architectures and distributed settings validate our theoretical insights, and additionally confirm the effectiveness of MAR loss as an application of our analysis.
0
0
quant-ph 2026-07-03

k-qubit memory forces Θ(n-k) samples for stabilizer testing

by Srinivasan Arunachalam, Louis Schatzki

Optimal Stabilizer Testing and Learning with Limited Quantum Memory

The usual constant-copy tester vanishes; learning costs Θ(n²/k) non-adaptively, so testing and learning match when memory is fractional

Figure from the paper full image
abstract click to expand
We study stabilizer state testing and learning with limited coherent quantum memory. Here an algorithm sequentially receives copies of an unknown $n$-qubit state, but may keep only $k$ qubits of coherent quantum memory between measurements. With unrestricted memory, seminal work of Gross, Nezami and Walter showed how to test $n$-qubit stabilizer states using $6$ copies, which is dimension independent, unlike the learning complexity of $\Theta(n)$. We show that this testing-vs-learning separation is lost under memory constraints. More concretely we show that (1) The sample complexity of testing stabilizer states in the $k$-qubit memory framework is $\Theta(n-k)$. Our upper bound goes via a novel connection to the hidden shift problem and the lower bound is proven using a novel approach to average case bounds on likelihood ratios via combinatorics of the stochastic orthogonal group. (2) The sample complexity of learning stabilizer states with $k$ qubits of memory, in the non-adaptive framework, is $\Theta(n^2/k)$. As a further application of our techniques, we prove an exponential lower bound for purity testing even when the memory may be left coherent throughout the protocol. Our main results identify coherent quantum memory as the resource enabling the usual separation between stabilizer testing and learning. In particular, even with $k=0.99n$ qubits of memory, there is no constant-copy stabilizer tester; furthermore for $k=cn$ qubits of memory (for $0< c < 1$), stabilizer testing is as hard as learning, with both requiring $\Theta(n)$ copies.
0
0
cs.DS 2026-07-03

Heavy-edge technique yields 1.622k approx for n-pairs paths

by Avi Kadria, Liam Roditty +1 more

Improved Approximation Algorithms for n-Pairs Shortest Paths

Converts W_uv-dependent approximations to multiplicative ones in Õ(mn^{1/k} + n^{1+2/k}) time

Figure from the paper full image
abstract click to expand
Let $G = (V, E)$ be a graph with $n = |V|$ nodes and $m = |E|$ edges. The $t$-Pairs Shortest Paths problem, introduced by Cohen [FOCS'93; SICOMP'99], asks to approximate the distances between $t$ prespecified pairs of vertices. Recently, this problem has received renewed attention, particularly in the case where $t = \Theta(n)$: the $n$-Pairs Shortest Paths problem. In this setting, new algorithms and conditional lower bounds have been developed by Dalirrooyfard, Jin, Vassilevska Williams, and Wein [FOCS'22], and Chechik, Hoch, and Lifshitz [SODA'25]. In this paper, we present the first algorithm for the $n$-Pairs Shortest Paths problem in \textit{weighted} undirected graphs that achieves a $(2 - \alpha)k$-approximation, for constant $\alpha > 0$, that runs in $\tilde{O}(mn^{1/k} + n^{1 + 2/k})$ time. Specifically, we present a $1.622k$-approximation, improving upon the $(2k - 3)$-approximation of Chechik, Hoch, and Lifshitz [SODA'25] for graphs that are not super sparse, which answers in the affirmative the open question posed by them. We also develop improved approximation algorithms with better tradeoffs for unweighted graphs and dense weighted graphs that improve upon the results of Dalirrooyfard \etal~and Chechik, Hoch, and Lifshitz. Our main technical contribution is the new \textit{heavy-edge} technique. Using this technique, we transform an algorithm with an approximation guarantee that depends on $W_{uv}$, the weight of the heaviest edge on the shortest path between $u$ and $v$, into an algorithm with purely multiplicative approximation that does not depend on $W_{uv}$.
0
0
cs.SE 2026-07-03

Traffic model spots REST API attacks at 82% recall without docs

by Ran Dubin, Amit Dvir

HTTP REST API Structure Learning

HRAL builds endpoint baselines from network data alone, outperforming alternatives when documentation is incomplete and hitting 100% with si

Figure from the paper full image
abstract click to expand
Application Programming Interfaces (APIs) are essential in software development, enabling web services, mobile apps, and microservices. However, their widespread use introduces significant security risks, highlighting the importance of API security. This paper presents HTTP REST API Learning (HRAL), a novel unsupervised anomaly detection approach that models the structure and behavior of API endpoints directly from network traffic, without relying on predefined rules or documentation. HRAL enables robust detection of malicious activity by understanding how APIs behave and flagging deviations as potential threats. We evaluate HRAL across varying levels of OpenAPI documentation detail and compare it with existing techniques. HRAL achieves strong performance, with an average recall of 82.07% and an F1-score of 87.24%, significantly outperforming alternatives when API documentation is limited. Moreover, our results approach the effectiveness of full API document definitions. When combined with signature-based rules such as the OWASP ModSecurity CRS, our system achieves 100% detection. These results highlight HRAL's effectiveness in real-world, partially documented API environments and its potential as a foundational layer for modern API security solutions.
0
0
cs.AI 2026-07-03

GPT-5.5 tops EvoPolicyGym on autonomous policy evolution

by Zhilin Wang, Han Song +14 more

EvoPolicyGym: Evaluating Autonomous Policy Evolution in Interactive Environments

Benchmark tests iterative policy editing in 16 RL environments and finds top models succeed by discovering task mechanisms under budget cons

abstract click to expand
Autonomous agents are increasingly expected to improve executable policies through feedback, yet existing evaluations often collapse this process into a final score or confound it with open-ended software-engineering progress. We introduce Autonomous Policy Evolution, a controlled evaluation setting in which a harness-model agent repeatedly edits an executable policy system under a fixed interaction budget. We instantiate this setting in EvoPolicyGym, a benchmark built from compact interactive RL environments that evaluates how agents iteratively improve explored policies. On the EvoPolicyGym suite, GPT-5.5 achieves the strongest aggregate rank score and top-two performance on all 16 environments. Beyond leaderboard results, EvoPolicyGym also provides trajectory-level diagnostics that distinguish how agents allocate budget, convert feedback into parametric tuning. These analyses show that strong autonomous policy evolution depends not only on isolated task wins, but on discovering task-appropriate mechanisms and refining policies under bounded feedback.
0
0
cs.LG 2026-07-03

Extreme attention improves streamflow forecasts

by Sanjeev Shrestha, Hui Liu +1 more

Extreme Adaptive Transformer for Time Series Forecasting

An added attention component for rare peaks yields better 3-day predictions than standard transformers on four hydrologic datasets.

Figure from the paper full image
abstract click to expand
Time series forecasting remains challenging when the underlying data contain rare but critical extreme events. This issue is particularly important in hydrologic forecasting, where streamflow distributions are often highly skewed and extreme peaks can have substantial impacts on flood monitoring, water resource management, and early warning systems. Although Transformer-based forecasting models have achieved strong performance by modeling long-range temporal dependencies, they typically treat all time points uniformly and may therefore underrepresent rare extreme patterns. In this paper, we propose the Extreme-Adaptive Transformer (Exformer), a forecasting framework designed to explicitly model temporal dependencies involving both normal and extreme events. Exformer introduces an extreme-adaptive attention mechanism composed of three sparse components: Local, Stride, and Extreme. The Local and Stride components capture short-term and periodic temporal dependencies, respectively, while the Extreme component selectively models event-aware dependencies between normal and extreme streamflow patterns. Experiments on four real-world hydrologic streamflow datasets show that Exformer achieves superior 3-day forecasting performance compared with state-of-the-art baselines. Our findings demonstrate that explicitly incorporating extreme-aware attention improves the forecasting capacity of Transformer models on imbalanced time series with rare but consequential events.
0
0
cs.SE 2026-07-03

Reasoning effort raises perfect agent code runs from 28% to 89%

by Achint Mehta

Reasoning effort, not tool access, buys first-try reliability in agentic code generation: an observational study

90-run study finds extra reasoning cuts corrections fivefold while testing tools add cost without reliability gains.

Figure from the paper full image
abstract click to expand
Agentic coding assistants are increasingly given extra capabilities, such as browser based testing tools and design oriented system prompts, on the assumption that more capability yields better software. This study tested that assumption directly. Ninety independent agent runs built the same application, a real time retrospective board, from one detailed specification, each scored on a fixed 14 criterion functional rubric (42 point maximum) and a visual quality review. The runs spanned several model generations, two agent harnesses, two reasoning effort levels, a testing tool, and two design oriented prompts. Capability tier dominated: frontier models clustered near the ceiling while a low cost local model fell to 24 to 37 points. A criterion level analysis revealed what run totals conceal. Container deployment was the dominant defect, failing first try in 44 percent of runs, with its failure rate shifting sharply across model generations while mean totals moved less than a point. The testing tool raised cost by 42 to 68 percent without improving functional score or reliability, even on interface visible criteria. Raising reasoning effort from High to xHigh lifted first try perfect runs from 28 percent to 89 percent and cut corrective prompts about five fold, for 9 to 29 percent more cost. A design oriented prompt raised visual quality, 4.5 versus 3.0 on a 5 point scale, without lifting function, and a one paragraph paraphrase of its directive reproduced the entire lift. The practical lesson is to match the fix to the failure: most first run failures came from weak reasoning, which a stronger model or more effort prevents, not from visible flaws a checking tool would catch.
0
0
cs.CV 2026-07-03

Multi-expert model cuts OOD false positives on medical scans

by A.S. Anudeep, Vaanathi Sundaresan

MARVEL: Margin-Aware Robust von Mises-Fischer Expert Learning for Long-Tailed Out-of-Distribution Detection

Margin-aware nonlinear von Mises-Fisher experts plus outlier specialist achieve up to 37 percent lower error rates on three datasets.

Figure from the paper full image
abstract click to expand
For clinical deployment, it is essential that automated diagnostic systems remain reliable when confronted with previously unseen cases, yet deep models routinely misclassify out-of-distribution (OOD) inputs with high confidence, underscoring the need for more robust OOD detection methods. Although substantial effort has been devoted to improving model robustness, most of the existing literature assumes balanced datasets, evaluates OOD detection on coarse or non-clinical OOD sources, or lacks comprehensive assessment across diverse OOD scenarios. To address the gaps, we propose a novel methodology trained on diverse and imbalanced medical datasets and evaluated across a clinically reflective OOD spectrum. Our framework comprises three key components: (1) a Nonlinear von Mises-Fisher (NvMF) classifier capable of learning non-linear decision boundaries, with theoretical proof of its asymptotic connection to cosine classifiers; (2) a multi-expert framework in which margin-aware NvMF classifiers specialise in different regions of label distribution to better handle imbalance; and (3) an outlier expert trained explicitly to distinguish inlier from outlier data, thereby strengthening OOD detection. Evaluation on RFMiD, ISIC2019, and NCTCRC datasets demonstrates consistent improvements over state-of-the-art methods, achieving mean FPR95 reductions of 8.45%, 13.02%, and 36.90% respectively. These gains are further supported by comprehensive ablations that validated the contributions of each component. This enables reliable identification of unfamiliar cases for deferral to clinicians, supporting safer AI-assisted diagnosis in real-world workflows. Our code is available at https://github.com/redboxup/MARVEL.
0
0
cs.AI 2026-07-03

Gemini matches experts grading bash commands with rubrics

by Manuel Alonso-Carracedo, Ruben Fernandez-Boullon +3 more

Automated grading of Linux/bash examinations using large language models: a four-level cognitive taxonomy approach

Agreement hits 0.89 on basic questions but drops at higher complexity levels, guiding when to mix AI and human review.

Figure from the paper full image
abstract click to expand
Scalable and reliable grading of command-line examinations remains a challenge in computing education, where rising enrolments make manual marking difficult and rule-based autograders cannot handle partial credit, equivalent solutions, or syntactic variation. This paper evaluates whether four frontier Large Language Models (GPT, Claude Opus, Gemini, and GLM) can approximate expert judgment when grading short Linux/bash command responses. The study adopts a four-level cognitive taxonomy that combines cognitive complexity and operational impact, ranging from information retrieval (L1) and basic file manipulation (L2) to structural operations (L3) and advanced system management (L4). The models were tested with two prompt variants, a minimal baseline and a rubric-enhanced version, on 1200 real responses from second-year Computer Engineering students independently graded by three expert instructors. Gemini~3.0 Pro with rubric-guided prompting achieved the highest human-AI agreement (ICC(3,1) = 0.888, MAE = 0.10, Bland-Altman bias = -0.014). Agreement declined consistently as taxonomy level increased, with the largest discrepancies at higher levels. Across all models, rubric quality had a larger effect than provider choice, with structured prompts consistently improving agreement. These results show that question complexity is a reliable predictor of the difficulty LLMs face in grading accurately, and they establish a principled, taxonomy-based framework for determining which questions are suitable for AI-assisted grading and which require human review, while also providing a transferable evaluation protocol and prompt templates.
0
0
cs.RO 2026-07-03

Robot RL success rises 28% with closed world-model loop

by Yuquan Xue, Le Xu +6 more

WorldSample: Closed-loop Real-robot RL with World Modelling

Post-trained world model supplies synthetic transitions that Policy-Paced Learning uses to cut training steps by 59% on contact-rich tasks.

abstract click to expand
Reinforcement learning (RL) can overcome the demonstration-coverage limitation of imitation learning (IL) by allowing robots to improve through trial-and-error interaction beyond the states observed in demonstrations. However, deploying RL on real robots remains constrained by high interaction costs, since each physical rollout is costly and reflects only one realized action-outcome path. To address this challenge, we propose WorldSample, a physically grounded data augmentation framework for real-robot RL that closes a real-synthetic loop between physical rollouts, world-model generation, and policy improvement. Grounded on real rollouts, WorldSample generates high-fidelity synthetic transitions through a post-trained world model, which greatly lowers the visual hallucination. Specifically, rather than simply using these transitions as real-world experience, WorldSample introduces Policy-Paced Learning (PPL) to regulate the training process through sample selection and scheduling, balancing useful augmentation against value overestimation and mitigating the hallucination-induced noise. Experiments on robot manipulation tasks involving contact-rich and precise tasks show that WorldSample improves policy success rate by 28% while reducing training steps by 59% compared with baselines. Furthermore, WorldSample improves world model visual fidelity by 19.4dB in PSNR and 0.47 in SSIM over demonstration-only post-training, validating the effectiveness of the real-synthetic loop for both policy and world model performance.
0
0
cs.HC 2026-07-03

Physical surfaces boost VR touch precision and bimanual use

by Wen Ying, Seongkook Heo

Physical surfaces make touch interactions in virtual reality precise, efficient, and bimanual

Portable tangible surfaces outperform visual and vibrotactile feedback alone on selection accuracy, tracing speed, and sketch quality while

Figure from the paper full image
abstract click to expand
Virtual reality (VR) systems can enable convenient hand-based interactions across diverse work scenarios. However, mid-air gestures lack tactile feedback and a physical reference surface to support the hand. This absence of haptic grounding can cause significant challenges in achieving precise and efficient touch interactions. This paper investigates the effect of different types of hand-grounded haptic feedback on the touch performance of VR tasks that demand high precision, such as selecting, tracing, and sketching. We compared three levels of haptic feedback: 1) No Haptic Feedback, where only visual feedback was provided; 2) Tactile Feedback, where users received vibrotactile and pressure feedback upon touching a virtual surface; 3) Physical Surface, where users interacted with a portable and tangible surface. Our study found that portable physical surfaces enabled the best selection precision, tracing efficiency, and sketch quality. Furthermore, participants showed increased bimanual hand utilization when engaging with a physical surface during tasks. These observed behaviors corresponded to participants' preference for interacting with physical surfaces, attributed to a better sense of confidence and control.
0
0
physics.ins-det 2026-07-03

Framework unifies multi-FPGA hardware and software for smart TDAQ

by Roberto Ammendola, Andrea Biagioni +13 more

APEIRON: composing smart TDAQ systems for high energy physics experiments

APEIRON covers device drivers to HLS dataflow models for real-time particle physics triggers like NA62.

Figure from the paper full image
abstract click to expand
We present APEIRON, a distributed heterogeneous processing framework comprising both hardware architecture and software stack for multi-FPGA systems. Targeting smart trigger and data acquisition (TDAQ) systems in high energy physics, APEIRON spans the full software hierarchy: from low-level device drivers to a high-level dataflow programming model based on High-Level Synthesis. We describe the framework design, its core communication infrastructure, and a particle identification application for the NA62 experiment as a representative physics use case.
0
0
eess.IV 2026-07-03

Self-auditing drift model leads SSIM in accelerated knee MRI

by Qing Lyu, Jianxu Wang +3 more

Self-Auditing Residual Drifting for Pathology-Preserving Accelerated Knee MRI

It adds per-slice risk scores that flag unreliable outputs while preserving lesion detail at high acceleration.

Figure from the paper full image
abstract click to expand
Accelerated magnetic resonance imaging reduces acquisition time, but reconstruction from undersampled k-space can blur diagnostically relevant structures or introduce failures that are not captured by global image metrics. We propose SA-RDM-DC, a Self-Auditing Residual generative Drifting Model with Data Consistency for accelerated knee MRI. The method adapts the newly proposed generative drifting paradigm to accelerated MRI by training a physics-conditioned drift field from the zero-filled reconstruction toward the fully sampled residual correction. It predicts image- and missing-k-space residual corrections, enforces data consistency with acquired k-space, uses frequency-aware and residual drifting supervision to recover fine detail, and produces dense error maps and slice-level risk scores in the same inference pass. We evaluate SA-RDM-DC on multi-coil fastMRI knee data at acceleration factors of 4, 8, and 12, with fastMRI+ pathology annotations for region-level and classifier-based task preservation, and on SKM-TEA for zero-shot and fine-tuned protocol-shift evaluation. Compared with zero-filled reconstruction, UNet-image-SENSE, DC-UNet, Score-Diffusion, ELF-Diff, SENSE-VarNet, and MoDL baselines, SA-RDM-DC achieves the highest SSIM across fastMRI acceleration factors while retaining subsecond per-slice inference and avoiding the long sampling time of iterative diffusion baselines. In pathology-aware analysis, SA-RDM-DC preserves lesion-region structural fidelity and reduces meniscus prediction instability. Its self-auditing scores strongly identify high-error reconstructions on fastMRI and partially transfer as a selective-review signal under SKM-TEA protocol shift. These results support reconstruction evaluation that jointly considers image fidelity, pathology preservation, runtime, and case-specific reliability.
0
0
cs.LG 2026-07-03

Quantum circuit fuses sensors with 72 params in federated learning

by Quoc Bao Phan, Tuy Tan Nguyen

QFedAgent: Quantum-Enhanced Personalized Federated Learning for Multi-Agent Activity Recognition

Replaces 33K classical parameters yet reaches 97.7% accuracy on non-IID wearable activity data.

Figure from the paper full image
abstract click to expand
Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, making it suitable for privacy-sensitive robotic sensing applications. However, multi-agent systems generate heterogeneous and non-independent and identically distributed (non-IID) multimodal sensor streams that degrade conventional FL algorithms, while classical fusion modules introduce substantial parameter overhead and communication cost. This paper proposes QFedAgent, a hybrid quantum-classical personalized FL framework for multi-agent activity recognition. The approach integrates a variational quantum circuit fusion module that models accelerometer--gyroscope interactions through quantum state encoding and entanglement, requiring only 72 quantum rotation parameters versus 33K in classical multi-layer perceptron-based fusion, achieving approximately 10x total parameter reduction. Experiments on the OPPORTUNITY dataset under subject-based non-IID partitions demonstrate 97.7% mean test accuracy, confirming that parameter-efficient quantum fusion remains competitive with conventional federated baselines.
0
0
cs.CV 2026-07-03

Scene graphs forecast how actions reshape environments

by Francesca Pistilli, Simone Alberto Peirone +1 more

Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs

A graph model on evolving scene graphs from first-person videos beats video baselines on retrieval and long-horizon reasoning tasks.

Figure from the paper full image
abstract click to expand
Understanding human behavior while interacting with the surrounding world is crucial for many applications of embodied AI. First-person videos are particularly informative for this problem, as they well capture how activities reshape the scene over time. However, existing approaches often rely on implicit visual or language-aligned representations, disregarding structured reasoning over the scene dynamic. We argue that explicit, compositional and editable representations of human-environment interactions can play a crucial role for rich grounded activity understanding. To this end, we introduce SG-Ego, a large scale annotation set extending Ego4D with spatio-temporal scene graphs, where relations triplets are consolidated over time into explicit time-evolving descriptions of the scene state. To reason over this representation, we propose GLEN, a graph-based model that operates over scene graph sequences to both align them with textual actions and model their temporal evolution. In addition, we formulate the activity-driven graph-edit forecasting (A-GEF) problem, a novel task that casts scene dynamics as a sequence of structured transformations conditioned on ongoing actions, enabling explicit reasoning about how scenes change over time. We validate our approach across multiple downstream tasks, spanning retrieval benchmarks as EgoMCQ and EgoCVR, as well as long-horizon reasoning benchmarks as EXPLORE-Bench and the newly introduced A-GEF. GLEN achieves strong results compared to raw video baselines and it excels in reasoning settings, typically addressed only with MLLMs, while enabling controllable and structured predictions of scene dynamics driven by human activities. We believe our results establish spatio-temporal scene graphs, together with models that reason over them, as strong compositional and interpretable representations for video understanding and potentially beyond.
0
0
cs.LG 2026-07-03

Neuron activations select stronger few-shot samples for LLMs

by Zhuowei Chen, Liwei Chen +3 more

Neuron-Aware Active Few-Shot Learning for LLMs

By tracking internal patterns for diversity and low consensus, NeuFS cuts annotation cost while beating output-entropy and embedding baselin

Figure from the paper full image
abstract click to expand
Active Few-Shot Learning (AFSL) adapts LLMs to specialized domains by identifying the most valuable unlabeled samples for annotation and use as few-shot demonstrations, effectively reducing human annotation costs while promoting high performance. However, existing methods typically rely on output-level signals for sample identification, such as predictive entropy or semantic similarities with test-time data based on external embeddings, which often overlook models' internal dynamics, which could pinpoint specific knowledge gaps. To bridge this gap, we propose NeuFS, a Neuron-Aware Active Few-Shot Learning framework that shifts the selection paradigm from output-level proxies to models' internal dynamics. NeuFS utilizes neuron activation patterns to represent sample directly, and includes a dual-criteria selection strategy that: (1) ensures few-shot sample diversity with neuron patterns for broader example coverage, while (2) prioritizing on identifying informative and challenging few-shot samples LLMs tend to hallucinate by quantifying neuron consensus. Experiments on three datasets demonstrate that NeuFS excels in both reasoning and text classification tasks, outperforming existing AFSL baselines. Ablation studies further highlight that internal neuron activations provide a more principled and effective selection signal than external embeddings, validating the superiority of the proposed NeuFS.
0
0
cond-mat.dis-nn 2026-07-03

Learning fails to converge in most large random games

by Desmond Chan, Tobias Galla

Complex dynamics in the Sherrington-Kirkpatrick game

Memory-loss rate and competitiveness set whether dynamics settle to one point, many points, or stay volatile.

Figure from the paper full image
abstract click to expand
We study the outcome of adaptive learning of a large number of players engaging in sets of two-strategy two-player games. We are interested in typical games, and generate the payoff matrices at random at the beginning. The payoff matrices then remain fixed during the learning process. This provides a game theoretic foundation for the Sherrington-Kirkpatrick (SK) game, recently introduced by Garnier-Brun, Benzaquen and Bouchaud. The original model by these authors is a special case, with no bias towards any strategy. We here determine stability of learning for SK games with general random bias, and find that the nature of the stable state is affected by random fields. We also introduce a grand-canonical version of the SK game, in which players can choose to abstain. We determine the stability of learning for this game. Our analysis confirms that complex situations involving many players are frequently unlearnable, even if each player only chooses between two different actions. The rate with which players lose memory of past payoffs and the competitiveness of the game emerge as key parameters determining whether learning converges to a unique fixed point, whether there are many fixed points, or if the dynamics remains persistently volatile.
0
0
cs.CV 2026-07-03

Wavelet compensation strengthens global edits in inversion-free image editing

by Anqi Tang, Wenhao Sun +1 more

Wavelet-Guided Semantic Signal Compensation for Inversion-Free Image Editing

The frequency-aware strategy boosts early semantic signals while keeping background structures intact.

Figure from the paper full image
abstract click to expand
Text-guided image editing aims to modify visual content according to a target prompt while preserving the background. Recent inversion-free image editing frameworks such as FlowEdit have demonstrated strong editing capability without requiring inversion. Empirically, FlowEdit can achieve substantial semantic changes under appropriate hyperparameter settings. However, we observe that under certain global attribute shifts, the editing trajectory may not effectively move away from the source distribution in the early timesteps. Our analysis suggests that in the high-noise regime, the dominant manifold-seeking flow toward the data manifold can reduce the influence of the text-conditioned direction, leading to limited global modification while background structures remain only moderately preserved. Inspired by this observation, we propose an inversion-free, frequency-aware semantic compensation strategy that strengthens the effective signal in the early stage of generation, while maintaining structural consistency in the background. The proposed method improves global editing capacity without sacrificing background fidelity.
0
0
cs.RO 2026-07-03

Model learns intent-driven camera poses from passive video

by Boyang Sun, Jiajie Li +7 more

LIME: Learning Intent-aware Camera Motion from Egocentric Video

LIME mines language intents and view gains from egocentric recordings to train robots on choosing next viewpoints.

Figure from the paper full image
abstract click to expand
Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.
0
0
cs.CL 2026-07-03

NLP authors shift from core conferences to ML venues

by David Jurgens

The Future of NLP may not be at NLP Conferences: Scholarly Migration Patterns in Natural Language Processing

Established authors lose 19pp at main ACL tracks; new authors raise ML share from 5% to 21% due to citation premiums.

Figure from the paper full image
abstract click to expand
Natural Language Processing (NLP) has traditionally been published in its core disciplinary venues like ACL. However, advances in Large Language Models (LLMs) has led to a blurring of the disciplinary lines between NLP and general Machine Learning (ML), with authors regularly publishing in venues from both fields. Here, we ask whether the disciplinary center of gravity is shifting. Using NLP research published from 2010 to 2026 and studies of both established and new authors, we find that a migration is taking place. First, comparing the pre- and post-LLM eras, established authors lost 19.2pp of share at flagship *ACL main-conference tracks while gaining 14.8pp in the newer Findings tracks, and general ML venues rose 8.6pp, even when adjusting for parallel growth in the fields. Second, among newer authors who debut with at least three first-author NLP-topic papers, the share whose work appears mostly at *ACL venues fell from 84% (2019) to 74% (2024), while the share appearing mostly at general ML venues rose from 5% to 21%. Using causal inference techniques, we estimate that these general ML venues confer a significant citation premium, which influences venue selection. Together, these results point to a significant shift in where NLP research is published.
0
0
cond-mat.quant-gas 2026-07-03

Package merges ML detection with BEC image analysis

by M. Doris, S. Guo +6 more

Q-GAIN: A Python Package for Machine Learning and Physically Informed Analysis Applications

Q-GAIN supplies modular tools for classification and object detection, shown on solitons and vortices in cold-atom data.

Figure from the paper full image
abstract click to expand
Here we describe the quantum gas analysis and inference (Q-GAIN) Python package, which enables rapid deployment of machine learning (ML) and physics-informed analysis techniques for cold-atom experiments. Out of the box, Q-GAIN implements classification, object detection, and physics-informed metrics for feature detection in images of atomic Bose-Einstein condensates (BECs). Q-GAIN encourages a natural, module-based workflow: starting with data loading and preprocessing, followed by ML-based feature identification, and ending with conventional analysis techniques. We demonstrate this modularity by configuring Q-GAIN for three ML tasks. First, we demonstrate the basic workflow of the Q-GAIN framework by implementing the standard task of classifying handwritten digits from the MNIST dataset. Then, we re-implement our earlier soliton detection (SolDet) package in the Q-GAIN framework, enabling the detection and analysis of solitonic excitations in time-of-flight data. Finally, we develop an object-detection tool that identifies quantized vortices in images of ring-shaped BECs.
0
0
cs.AI 2026-07-03

SPG-Layout generates plausible 3D rooms in angled spaces

by Xianhui Meng, Zirui Song +11 more

Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

Statistical priors and large-object-first ordering reduce violations when text drives non-rectangular indoor scenes.

Figure from the paper full image
abstract click to expand
Large Language Models (LLMs) have demonstrated remarkable capabilities in 3D indoor synthesis for Manhattan environments. However, existing methods often fail to capture plausible object layout patterns in non-Manhattan settings, primarily because they struggle to model non-orthogonal spatial relationships, leading to high geometric violations and low physical fidelity. To address this challenge, we propose SPG-Layout, a novel text-driven framework designed to generate physically plausible indoor scenes within complex non-Manhattan environments. Specifically, we first utilize statistical priors of object distributions to guide the training process, enhancing environmental understanding and fidelity. Furthermore, mirroring human design workflows, we adopt a hierarchical layout strategy that prioritizes the placement of large objects, thereby substantially minimizing layout violations. By synergizing these components, SPG-Layout achieves a balanced optimization of semantic realism and physical plausibility. To evaluate performance in these complex settings, we constructed a new benchmark comprising 500 diverse non-Manhattan environments. Extensive experiments demonstrate that SPG-Layout consistently and significantly outperforms existing methods across both Manhattan and non-Manhattan environments. The code will be publicly released.
0
0
cs.CV 2026-07-03

Object LeJEPA beats image LeJEPA on four tasks with 10-100% COCO

by Jakob Geusen, Ender Konukoglu

Object-centric LeJEPA

Fixed SAM masks let the distributional objective align objects instead of scenes, raising tracking, classification, segmentation and re-iden

Figure from the paper full image
abstract click to expand
Image encoders trained with LeJEPA can deliver strong features for downstream tasks, but, like other image-level self-supervised methods, typically require large training datasets. Aligning representations at the level of objects rather than whole scenes promises greater data efficiency, but doing this in a completely self-supervised way, effectively jointly partitioning a scene and representing its objects, is unstable: the two are locked in a cyclic dependency, partitioning requires meaningful representations, while meaningful representations require consistent partitioning. We sidestep this instability by taking object masks as given during training, using cheap, off-the-shelf SAM proposals. We extend LeJEPA - whose distributional anti-collapse objective ports naturally from whole images to variable-sized sets of objects - to align object-centric representations rather than whole images. An additional instance-separating loss, which treats other objects in the same scene as negatives, further boosts downstream performance. Across two model scales and 10-100% of COCO, object-level LeJEPA outperforms image-level LeJEPA on tracking (DAVIS), classification (ImageNet-1k), segmentation (ADE20k), and re-identification (NAVI).
0
0
cs.RO 2026-07-03

Inverse-dynamics check cuts planning compute in world models

by Gawon Seo, Dongwon Kim +1 more

ACID: Action Consistency via Inverse Dynamics for Planning with World Models

ACID penalizes predicted transitions that cannot recover the conditioned action, matching baseline accuracy with far fewer evaluations acros

Figure from the paper full image
abstract click to expand
Decision-time planning with action-conditioned world models has become a popular paradigm for embodied control. However, the standard planning cost judges a candidate solely by how close its predicted terminal state lies to the goal, leaving the realizability of the intermediate transitions unchecked -- a predicted trajectory can look convincing while the environment rollout drifts away from it. In this paper, we propose ACID, a decision-time planning framework that introduces cycle action consistency: the action inferred backward from a predicted transition by an inverse dynamics model should recover the one that was conditioned on. We fold this per-step residual into the planning cost via a scale-invariant adaptive weight. Across four action-conditioned world models and six tasks spanning rigid and deformable manipulation, articulated control, and visual navigation, ACID consistently improves planning and matches the baseline's accuracy with substantially less planning compute.
0
0
cs.CV 2026-07-03

Framework infers visual concepts from image example sets

by Nick Stracke, Kolja Bauer +4 more

Show Me Examples: Inferring Visual Concepts from Image Sets

It generates new images that apply the shared concept to a query, outperforming standard vision-language models on accuracy and diversity.

Figure from the paper full image
abstract click to expand
Vision-language models (VLMs) can follow complex textual instructions, yet they struggle to reason from purely visual context. In particular, current models fail to infer shared concepts from sets of example images and apply them to new inputs. We introduce Visual Concept Inference from Sets (VICIS), a task that evaluates this capability. Given a small context set of images sharing a concept and a query image, the model must generate new images that preserve the context-defined concept while remaining consistent with the query. We show that state-of-the-art VLMs perform poorly on this task, often ignoring the visual context or defaulting to biased generations. To address this gap, we propose a training framework and architecture that learn to infer visual concepts from image sets and extract concept-specific embeddings from queries. Experiments on synthetic data and large-scale ImageNet/WordNet data show that our model generates more accurate and diverse outputs and generalizes to unseen concepts and modalities such as sketches.
0
0
cs.DC 2026-07-03

FlintKV reaches 75% higher throughput in durable NVM key-value stores

by Sergey Egorov (1, 2) +9 more

FlintKV: A Fast Durable Storage Engine for Modern Databases

The engine adds atomic batches, snapshots and iterators while preserving durable linearizability.

Figure from the paper full image
abstract click to expand
Byte-addressable non-volatile memory (NVM) offers an opportunity to rethink storage engine architectures. While recent NVM key-value stores achieve high throughput for ingestion and point lookups, they omit or under-specify the support for the richer interface guarantees required by modern databases. Production key-value engines (e.g., RocksDB) provide point-in-time snapshots, consistent iterators, and atomic batches-features essential for implementing transactions and concurrency control. We present FlintKV, an NVM-optimized skiplist-based storage engine that natively supports the full API of production key-value stores. FlintKV supports both atomic batch writes and snapshot-consistent iteration efficiently while guaranteeing durable linearizability. FlintKV can be deployed standalone or its durable skiplist can be integrated into existing NVM stores to enhance their capabilities. Central to FlintKV is a novel flat-combining based concurrency control algorithm that leverages multi-versioning and carefully co-designed persistence mechanisms to ensure high performance and scalability. Our empirical evaluation shows that FlintKV can achieve up to a 75% improvement in end-to-end throughput over prior work.
0
0
cs.CG 2026-07-03

Piecewise rational cap volumes give exact ham-sandwich algorithms

by Marie-Charlotte Brandenburg, Jesús A. De Loera +1 more

From Ham-Sandwich to Centerpoints: Semialgebraic Algorithms for Cutting Polytopal Measures

For polytopal measures the cap-volume function is piecewise rational, turning prescribed-proportion cuts into polynomial-time semialgebraic

abstract click to expand
We design exact algorithms for the ham-sandwich and centerpoint theorems for polytopal measures. Our key observation is that the cap-volume function of such a measure, i.e., the volume cut off by a halfspace, is piecewise rational on a natural decomposition of the space of oriented hyperplanes. This lets us recast prescribed-proportion cutting problems as semialgebraic feasibility problems. For fixed ambient dimension, this yields polynomial-time algorithms to decide the existence of cuts, describe the full solution set, and sample or enumerate solutions. We extend this framework to the center transversal theorem, showing that spaces of deep affine flats are semialgebraic, which holds for centerpoints. We further show that the set of centerpoints of a convex polytope coincides with its floating body at level $1/(d+1)$, a useful semialgebraic description.
0
0
cs.AI 2026-07-03

Adapted RFM finds refusal subspaces in seconds

by Thomas Winninger

Fast Multi-dimensional Refusal Subspaces via RFM-AGOP

The method works on both reasoning and non-reasoning models and beats alternatives on ablation tests.

abstract click to expand
Steering and monitoring activations in Large Language Models (LLMs) are increasingly used for both safety and interpretability. Early work assumed behaviours are encoded along single linear directions, but recent findings suggest complex behaviours, such as the refusal to answer harmful queries, live in multi-dimensional subspaces. However, existing methods for extracting these subspaces are computationally expensive, which becomes prohibitive on reasoning models who produce long reasoning traces. By adapting the Recursive Feature Machine (RFM) algorithm -- which can be computed efficiently -- with a probe-informed initialization, we are able to identify the multi-dimensional refusal subspace in seconds, on reasoning (Qwen 3) and non-reasoning (Qwen 2.5) models. While RFM allows for faster subspace identification, it also showed better performances on the ablation task than its alternatives. More work is planned to better understand the relations between subspaces found by different methods. If confirmed, RFM could be a cheap and scalable complement to existing subspace-extraction methods in LLMs.
0
0
cs.IT 2026-07-03

Fixed sparse connections cut phase-shifter count by up to 62 percent

by Honghao Wang, Qingqing Wu +4 more

Ultra-Low-Cost Hybrid Beamforming: A New Static-Connection Architecture with Sparse Phase-Shifter Sharing

The architecture keeps beamforming performance close to full-PS sub-connected designs while lowering hardware needs in single- and multi-RF

Figure from the paper full image
abstract click to expand
Hybrid beamforming is a promising solution for high-frequency multi-antenna wireless systems, but its implementation is constrained by the cost and complexity of analog phase-shifter (PS) networks. Although sub-connected architectures simplify the analog network, their conventional realization still requires a dedicated PS for each antenna, causing considerable layout area, wiring, calibration, and control overheads. To address this issue, this paper proposes a novel static-connection architecture with sparse PSs for ultra-low-cost sub-connected hybrid beamforming, where antennas within each sub-array share a PS through an optimized fixed PS-to-antenna connection matrix. The proposed architecture preserves static connections while enabling dynamic beam control via adaptive PS phase-shift adjustments and digital precoding. For the single-radio-frequency (RF)-chain scenario, the sparse-PS connection design is transformed into an antenna-grouping problem, with analytically characterized structural properties and an efficient algorithm. For the multi-RF-chain scenario, we develop a quality-of-service (QoS)-majorization-minimization (MM) algorithm to handle the mixed discrete-continuous optimization problem. Numerical results demonstrate that the proposed architecture reduces the PS count while preserving most beamforming capability of the traditional full-PS sub-connected architecture. In particular, the proposed design achieves PS-count reductions of 37.5% and 62.5% in single-RF-chain and multi-RF-chain systems, respectively, while avoiding deep-null and grating-lobe degradations associated with deterministic connection schemes. These results provide engineering insights into static sparse-PS sharing: the key to hardware-efficient hybrid beamforming is not merely reducing the PS count, but also preserving essential analog-domain degrees of freedom through optimized PS connection topologies.
0
0
cs.DC 2026-07-03

Models predict LLM power and latency on new GPUs with 3-14% error

by Mauricio Fadel Argerich, Jonathan Fürst +1 more

WattGPU: Predicting Inference Power and Latency on Unseen GPUs and LLMs

Public metadata alone enables matching models to hardware without profiling and halves to quarters baseline errors in server scenarios.

Figure from the paper full image
abstract click to expand
Large Language Model (LLM) inference workloads are a rapidly growing contributor to data center energy consumption. Optimizing these deployments requires matching specific LLMs to the most efficient GPUs, but operators currently lack the tools to do so without exhaustively profiling each combination. While some predictive models exist, they still require profiling data and struggle to generalize to hardware unseen during training. To address this, we introduce \textit{WattGPU}, featuring two predictive models for mean GPU power draw and Inter-Token Latency (ITL). Our approach leverages only publicly available LLM metadata and GPU specifications, eliminating the need for hardware access or profiling while enabling generalization to unseen NVIDIA server-grade GPUs and LLMs. We evaluate our models using rigorous leave-one-GPU-out and leave-one-LLM-out cross-validation on a dataset of 42 open-source LLMs (0.1B--27B parameters) and 8 GPUs under both offline and server scenarios. The mean power draw model achieves a median absolute percentage error of $\leq3.4\%$ for offline and $\leq13.5\%$ for server scenarios on unseen GPUs, while the latency model achieves $\leq8.5\%$ in server mode, both maintaining strong GPU ranking correlations for server scenarios (Kendall $\tau\geq0.76$). Compared to standard physically grounded baselines -- Load-Scaled Thermal Design Power (TDP) for power draw and roofline for latency -- our models reduce median absolute percentage error by approximately 4$\times$ on unseen LLM-GPU combinations for server scenarios or approximately 2$\times$ for completely unseen GPUs. WattGPU's data and code are publicly available at https://github.com/maufadel/wattgpu.
0
0
cs.LG 2026-07-03

Modular RL recombines code modules to solve problems sampling misses

by Juliette Decugis, Fabian Gloeckle +3 more

DecompRL: Solving Harder Problems by Learning Modular Code Generation

Decomposing into sub-functions creates up to k^n candidates while cutting GPU token cost by about 50 times.

abstract click to expand
How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attempt accuracy at the expense of sample diversity. Both strategies ultimately fail when the base policy has near-zero probability of producing a correct solution: no amount of sampling or gradient signal can overcome a search space that is simply too large. We take a different approach: rather than sampling harder, we make the task easier by decomposing problems into smaller, independently solvable sub-functions whose implementations can be recombined. Since off-the-shelf models are not trained for this modular generation, we introduce DecompRL, an RL algorithm that explicitly learns to decompose and implement hierarchical code structures. Recombining $k$ implementations of $n$ modules yields up to $k^{n}$ candidate solutions, shifting the bottleneck from GPU inference to cheap CPU evaluation and cutting GPU token cost by $\sim$50$\times$. On LiveCodeBench and CodeContests (Qwen~2.5~7B, Code World Model~32B), DecompRL outperforms standard and diversity-optimized RL baselines beyond $10^5$ tokens per problem, solving problems that standard generation cannot reach.
0
0
cs.AI 2026-07-03

Constraints lift coding-agent backdoor recall from 54.5% to 90.9%

by Thomas Winninger

Steerability via constraints: a substrate for scalable oversight of coding agents

Access controls and enforced conventions let a small reviewer model catch most inserted backdoors while cutting token cost.

abstract click to expand
Coding agents are capable; human oversight is the bottleneck. Unconstrained agents introduce security risks, erode codebase scalability, and make human review increasingly costly. We argue that the same methods used for decades to manage large human engineering teams: access control, network policies, strict coding conventions enforced by tooling; transfer directly to coding agents, and are cheaper (in token) than recent agentic scaffolding. We sketch a start-to-end system on this principle, and report a controlled experiment in scalable oversight: a small reviewer (Gemma 4 e4b) inspects a Python codebase containing 11 inserted backdoors. Recall rises from 54.5% (unconstrained, no tools) to 90.9% (constrained substrate plus a ~200-LoC `docs` CLI), with substrate and tools contributing independently. We choose Python deliberately: substrate-level oversight gains are largest where the language gives the fewest guarantees by default; the principles extend to languages like Rust.
0
0
cs.IR 2026-07-03

Agentic reranking lifts Earth data search MRR by 28%

by Minghan Yu, Youran Sun +3 more

Bringing Agentic Search to Earth Observation Data Discovery

Zero-shot LLM stage added to neural-BM25 fusion improves retrieval without extra training on NASA EO queries.

Figure from the paper full image
abstract click to expand
NASA and its data centers hold thousands of geoscience datasets and tools like Worldview, Giovanni, the Science Discovery Engine, and Harmony. Finding the right one is hard even for domain experts. We present an agentic search system, deployed as a public service for the geoscience community, that takes a natural-language research query and returns the matching datasets and tools. We demonstrate that, in the era of large language models, the latent value of knowledge graphs (KGs) can be substantially amplified through agentic search. From the NASA Earth Observation Knowledge Graph (NASA EO-KG) we derive NASA-EO-Bench, an open benchmark of 47k query-dataset pairs (21k task-based queries). A neural scorer fine-tuned on NASA-EO-Bench beats cosine and BM25 baselines. Further combining it with BM25 via score fusion raises both Recall@10 (R@10) and MRR by over 5x. On top of this supervised pipeline, we add a zero-shot agentic reranking stage that, without any additional training, lifts MRR by 28% on a stratified N=200 subset, showing that LLM reasoning is complementary to supervised retrieval.
0
0
cs.CV 2026-07-03

ViTs gain complexity by specializing layers while keeping tokens linked

by Kaustubh Kapil, Kishor P. Upla

Transformer Geometry Observatory TGO-II: Representational Similarity Observatory

CKA, SVCCA and dimensionality measures show manifold expansion without loss of token interactions across training.

Figure from the paper full image
abstract click to expand
While Vision Transformers have achieved remarkable success across computer vision and language applications, the geometric evolution of their internal representations throughout training remains insufficiently understood. Existing analyses primarily focus on attention mechanisms and downstream performance, leaving the evolution of representation geometry largely unexplored. In this work, we present Transformer Geometry Observatory-II (TGO-II), a representation geometry analysis framework designed to investigate how Transformer representations evolve during supervised training. TGO-II analyzes Vision Transformer (ViT-Small/16) representations using Centered Kernel Alignment (CKA), Singular Vector Canonical Correlation Analysis (SVCCA), Two-Nearest Neighbor Intrinsic Dimensionality (TwoNN-ID), and token covariance analysis. Our experiments reveal three key observations. First, both CKA and SVCCA progressively decrease throughout training, indicating increasing representational specialization across Transformer layers. Second, intrinsic dimensionality consistently increases before stabilizing, suggesting progressive expansion of the representation manifold into a larger set of locally accessible degrees of freedom. Third, token covariance and coupling analyses demonstrate that strong token interaction structure persists throughout training, challenging the hypothesis that increasing representational complexity arises primarily from progressive token independence. These findings suggest that representation complexity and layer specialization emerge simultaneously during training. Manifold expansion appears to occur without token decoupling. Together, these observations motivate a new hypothesis in which Vision Transformers increase representational complexity through progressively richer transformations while preserving strong token interaction structure during learning.
0

browse all of cs → full archive · search · sub-categories