cs.RO — Pith

Top Pith

1

cs.RO 2026-06-29

Physics models cut table tennis ball prediction error by 59%

by Christian Conti (1), Bilan Yang (1) +10 more

Physics Models for Sim-to-Real Transfer in Professional-Level Robot Table Tennis

Aerodynamic, buckling, and residual-contact models enable RL policies that compete against professional players after sim-to-real transfer.

abstract click to expand

At competitive speeds and spins, a table tennis ball follows complex, counterintuitive trajectories that a robot must track and precisely counter within fractions of a second. Training a reinforcement learning policy capable of these skills is prohibitively expensive and dangerous in the real world, making high-fidelity simulation essential. Transferability of such policies, however, critically depends on how faithfully the simulation captures real-world dynamics - a requirement made even more stringent by the adversarial nature of the game, where any modeling inaccuracy becomes an exploitable weakness for the opponent. Prior state-of-the-art in robot table tennis generally focuses on a limited range of velocities and spins and fails to capture the richness of ball behaviors encountered in professional-level play. In this work, we present physics models for aerodynamic ball flight, ball-table contact, and ball-racket contact. that accurately capture the ball behavior over a vast range of speeds and spins relevant to the game. Specifically, we model drag and Magnus force coefficients as functions of Reynolds number and spin ratio in the aerodynamics equations. For the table contact model we model effects of ball buckling on the coefficient of restitution and incorporate residuals into the instantaneous point-contact models. For the racket contact model, we introduce a residual neural network component to complement coefficients related to normal and tangential coefficients of restitution as well as torsional spin damping. Evaluated on an unprecedentedly large dataset of competitive matches (277 games), the proposed models significantly reduces prediction errors (e.g., 59% median landing-position error reduction). The resulting models were used to train the RL policies for the first real-world robot table tennis AI agent capable of competing against professional players.

1 0

Top Pith

5

cond-mat.mes-hall 2026-05-19 2 theorems

AI robotic lab creates graphene and atomically thin transistors

by Lihan Shi, Zhaoyi Joy Zheng +15 more

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus completes the first autonomous AI fabrication of vdW-stacked nanodevices with error correction in a closed physical loop.

abstract click to expand

While modern Large Language Models (LLMs) and agentic artificial intelligence (AI) have demonstrated transformative capabilities in digital domains, the realization of embodied AI capable of real-world scientific discovery remains a difficult frontier. The advancements are hindered by the inherent complexity of integrating high-level reasoning, multimodal information processing and real-time physical execution. Here we introduce Qumus, the first AI quantum materials experimentalist. Physically embodied within a robotic mini-laboratory, Qumus is an intelligent, multimodal, and multi-agent system designed for the creation and nano-processing of atomically thin two-dimensional (2D) materials and stacked van der Waals (vdW) structures. Qumus autonomously navigates the full scientific cycle, from hypothesis generation and protocol planning to multi-step experimental execution, result analysis and reporting, acting as an experimentalist. Markedly, the system has achieved, for the first time, the AI-creation of graphene, as well as the first AI-fabrication of complex nanodevices including atomically thin field-effect transistors via vdW stacking. Qumus excels at these tasks by demonstrating autonomous error correction and closed-loop experimentation. Our results establish a generalizable framework for self-improving embodied AI systems that learn directly from the quantum world, opening a pathway toward accelerated discovery in quantum materials, electronics and beyond.

0

Top Pith

2

cs.CV 2026-05-15 2 theorems

Diffusion models uncover semantic attacks on vehicle maps

by Chenyi Wang, Ruoyu Song +6 more

Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

Attacks using shadows or wet roads suppress 57% of detections and inject false boundaries where pixel methods fail entirely.

abstract click to expand

Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings -- safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80--84% of the time (vs. 97--99% for clean nuScenes), while AdvPatch only 0--9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

0

cs.RO 2026-07-03

Tactile deformation prediction lifts contact task success to 71.67%

by Shuai Tian, Yupeng Zheng +8 more

VT-WAM: Visual-Tactile World Action Model for Contact-Rich Manipulation

A flow-matching model with asymmetric attention and contact gating outperforms prior visual-tactile methods across six real manipulation tas

abstract click to expand

Contact-rich manipulation requires policies to react to local deformation, pressure, slip, and friction, yet these cues are temporally sparse and often invisible in visual observations. Existing visual-tactile policies usually feed tactile observations directly into action prediction, but rarely model tactile deformation dynamics during action generation. In this paper, we introduce VT-WAM, a Visual-Tactile World Action Model that jointly learns future visual prediction, tactile deformation prediction, and action prediction within a unified flow matching framework. In particular, VT-WAM introduces (1) Asymmetric Mixture-of-Transformers (MoT) attention to bridge a first-frame visual anchor with temporal tactile dynamics, and (2) contact-gated Action-Visual-Tactile Attention Guidance (AVTAG) to encourage action queries to rely on tactile evidence during contact phases. Across six real-world contact-rich manipulation tasks, VT-WAM achieves a 71.67% average success rate, outperforming Fast-WAM by 26.67% and OmniVTLA by 35.84%. Ablations demonstrate that modeling tactile deformation dynamics and guiding contact-phase tactile attention are both important for contact-rich tasks. Project website: https://vt-wam.github.io/.

0

cs.RO 2026-07-03

C++ runtime runs embodied models on any robot hardware

by Ling Xu, Chuyu Han +7 more

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Five-layer abstraction replaces per-model Python stacks while preserving 91-100 percent task success and cutting memory use by more than two

abstract click to expand

Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.

0

cs.RO 2026-07-03

Behavior latents add independent speed and safety control to traffic sims

by Juanwu Lu, Junyu Zhu +1 more

Controllable Sim Agents with Behavior Latents

CNeVA matches top imitation models on Waymo data while enabling monotone per-channel steering without reward hacks.

abstract click to expand

Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curriculum for classifier-free guidance. To tackle scarcity in reward signals, we propose soft eligibility gates that replace hard binary thresholds with smooth exponential decay, preserving the gradient signal for near-threshold agents. On the Waymo Open Motion Dataset, CNeVA attains competitive realism on the benchmark while exposing per-channel controllability that the higher-ranked imitation models lack. Speed- and acceleration-based steering produces monotone responses without stall-induced reward hacking. Safety controllability is monotone and substantial with the introduction of soft eligibility. We manage to achieve steerable map compliance under a context-residual return measure. Furthermore, our experiment demonstrates that steering metrics must be read alongside physical-plausibility guardrails to avoid reward-hacking confounds.

0

cs.RO 2026-07-03

QuadRocket achieves almost global trajectory tracking with adaptive control

by Pedro Santos, Joel Reis +2 more

QuadRocket: An Aerial Robotic Testbed for Adaptive Thrust-Vector Control of Rocket-Like Vehicles

The quadrotor-based rocket prototype models the vehicle as an axisymmetric body to enable disturbance rejection in thrust-vector control.

abstract click to expand

This paper presents QuadRocket, a quadrotor-based rocket prototype that provides a low-cost, low-risk platform for validating advanced thrust-vector control strategies for launch vehicle-type systems. The prototype consists of a cylindrical main body mounted on top of a quadrotor through a universal joint, forming a flying inverted pendulum with non-negligible inertia. For control design, the coupled system is modeled as a single axisymmetric rigid body actuated by a vectored force applied along its longitudinal axis. A reduced-attitude representation on the two sphere is adopted to explicitly exploit the vehicle's axial symmetry and to decouple yaw from the thrust-vector direction. On this model, we derive an adaptive backstepping controller that achieves almost global trajectory tracking in the presence of unknown constant disturbances, while a control-point transformation mitigates non minimum-phase behavior. The quadrotor is then treated as a thrust vector actuator, and a dynamic-surface-based attitude controller is designed to track the desired thrust-vector, accounting for actuation dynamics and avoiding explicit differentiation of virtual control signals. The complete architecture is evaluated in simulation and validated experimentally in an indoor motion-capture arena. Results demonstrate accurate trajectory tracking, effective disturbance compensation, and confirm the suitability of the QuadRocket as a versatile testbed for thrust-vector-controlled robotic vehicles.

0

cs.RO 2026-07-03

Quadrotor learns to intercept using only camera direction vectors

by Michael Anoruo, Xiaoyu Tian +6 more

Learning Agile Intruder Interception using Differentiable Quadrotor Dynamics

Differentiable dynamics replace point-mass models and raise success rate by 30 percent at speeds up to 10 m/s without needing distance or po

abstract click to expand

This paper presents a methodology for learning a control policy to intercept an intruder using the 3D direction unit vector to the intruder and the interceptor state. Prior deep reinforcement learning approaches assume either relative position or distance to the intruder is available, but this information is not readily accessible in real-world applications that employ passive, monocular camera sensors. Instead, we propose a solution that leverages an analytical policy gradient method using differentiable quadrotor dynamics to learn agile interception at speeds up to 10 m/s. The proposed approach outperforms baseline methods that utilize simplified point mass dynamics by an average of 30%.

0

cs.RO 2026-07-03

Unlabeled robot play pretraining matches million-expert VLA models

by Junhao Shi, Siyin Wang +4 more

Learning to Move Before Learning to Do: Task-Agnostic pretraining for VLAs

By learning movement skills first from cheap interactions then aligning to language with minimal labels, the method cuts labeled data needs

abstract click to expand

Vision-Language-Action (VLA) models are fundamentally bottlenecked by the scarcity of expert demonstrations -- triplets of observations, instructions, and actions that are costly to collect at scale. We argue that this bottleneck stems from conflating two distinct learning objectives: acquiring physical competence (how to move) and acquiring semantic alignment (what to do). Crucially, only the latter requires language supervision. Building on this Decomposition Hypothesis, we propose Task-Agnostic Pretraining (TAP), a two-stage framework that first learns transferable motor priors from cheap, unlabeled interaction data -- including discarded off-task trajectories and autonomous robot play -- via a self-supervised Inverse Dynamics objective. A lightweight second stage then grounds these priors in language using minimal expert data. On the SIMPLER benchmark, TAP matches models trained on over 1M expert trajectories while using orders of magnitude less labeled data, yielding a 10% absolute gain over standard behavior cloning. On a real-world WidowX platform, TAP retains 25% success under camera perturbations where internet-scale baselines collapse to 0%, demonstrating that task-agnostic pretraining produces robust, transferable physical representations and offers a scalable path forward for Embodied AI.

0

cs.RO 2026-07-03

Robot RL success rises 28% with closed world-model loop

by Yuquan Xue, Le Xu +6 more

WorldSample: Closed-loop Real-robot RL with World Modelling

Post-trained world model supplies synthetic transitions that Policy-Paced Learning uses to cut training steps by 59% on contact-rich tasks.

abstract click to expand

Reinforcement learning (RL) can overcome the demonstration-coverage limitation of imitation learning (IL) by allowing robots to improve through trial-and-error interaction beyond the states observed in demonstrations. However, deploying RL on real robots remains constrained by high interaction costs, since each physical rollout is costly and reflects only one realized action-outcome path. To address this challenge, we propose WorldSample, a physically grounded data augmentation framework for real-robot RL that closes a real-synthetic loop between physical rollouts, world-model generation, and policy improvement. Grounded on real rollouts, WorldSample generates high-fidelity synthetic transitions through a post-trained world model, which greatly lowers the visual hallucination. Specifically, rather than simply using these transitions as real-world experience, WorldSample introduces Policy-Paced Learning (PPL) to regulate the training process through sample selection and scheduling, balancing useful augmentation against value overestimation and mitigating the hallucination-induced noise. Experiments on robot manipulation tasks involving contact-rich and precise tasks show that WorldSample improves policy success rate by 28% while reducing training steps by 59% compared with baselines. Furthermore, WorldSample improves world model visual fidelity by 19.4dB in PSNR and 0.47 in SSIM over demonstration-only post-training, validating the effectiveness of the real-synthetic loop for both policy and world model performance.

0

cs.RO 2026-07-03

Model learns intent-driven camera poses from passive video

by Boyang Sun, Jiajie Li +7 more

LIME: Learning Intent-aware Camera Motion from Egocentric Video

LIME mines language intents and view gains from egocentric recordings to train robots on choosing next viewpoints.

abstract click to expand

Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.

0

cs.RO 2026-07-03

Inverse-dynamics check cuts planning compute in world models

by Gawon Seo, Dongwon Kim +1 more

ACID: Action Consistency via Inverse Dynamics for Planning with World Models

ACID penalizes predicted transitions that cannot recover the conditioned action, matching baseline accuracy with far fewer evaluations acros

abstract click to expand

Decision-time planning with action-conditioned world models has become a popular paradigm for embodied control. However, the standard planning cost judges a candidate solely by how close its predicted terminal state lies to the goal, leaving the realizability of the intermediate transitions unchecked -- a predicted trajectory can look convincing while the environment rollout drifts away from it. In this paper, we propose ACID, a decision-time planning framework that introduces cycle action consistency: the action inferred backward from a predicted transition by an inverse dynamics model should recover the one that was conditioned on. We fold this per-step residual into the planning cost via a scale-invariant adaptive weight. Across four action-conditioned world models and six tasks spanning rigid and deformable manipulation, articulated control, and visual navigation, ACID consistently improves planning and matches the baseline's accuracy with substantially less planning compute.

0

cs.RO 2026-07-03

Humanoid carries 24kg while tracking VR commands

by Chenxin Liu, Qingzhou Lu +5 more

HEFT: Heavy-Payload Full-size Humanoid Teleoperation with Privileged Motion Guidance and Windowed Payload Curriculum

Privileged guidance cleans noisy tracker data and a payload curriculum builds stability step by step on a full-size robot.

abstract click to expand

General motion tracking and teleoperation offer a promising path to scalable humanoid skill acquisition, yet most existing frameworks are validated on compact platforms or without real payload interaction, leaving full-size humanoids with real payloads largely unexplored. Scaling to full-size humanoids introduces two compounding challenges: their larger inertia and tighter balance margins make tracking highly sensitive to noise, drift, and retargeting errors from commodity VR trackers, while their payload potential remains largely underutilized. We present HEFT, a heavy-payload full-size humanoid teleoperation framework that addresses both challenges. HEFT learns from deployable noisy VR references with physically plausible reconstructed references through Privileged Motion Guidance (PMG), and uses a Windowed Payload Curriculum (WPC) with expert-guided payload caps to acquire robust heavy-payload tracking. We deploy HEFT on L7, a 175cm, 65kg humanoid. The robot tracks motions including turns, forward/backward locomotion, and squats under payloads up to 24kg.

0

cs.RO 2026-07-03

Hybrid moving-static camera data cuts VLA shortcut learning

by Jincheng Tang, Yilong Zhu +3 more

The Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection

Continuous motion plus diverse static views enables generalization to new poses where extra fixed viewpoints alone fail, across ACT, Diffusi

abstract click to expand

Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile. We argue that simply increasing the number of viewpoints is insufficient. Models often fall into the trap of Shortcut Learning, latching onto spurious correlations (e.g., fixed relative poses between objects or between the camera and robot base) rather than learning true spatial relationships. In this work, we propose a data-centric solution to enhance VLA spatial generalization. We utilize a dual-arm setup where one arm performs manipulation while the other serves as a mobile environmental camera. We systematically evaluate three data distribution patterns: Fixed, Multi-Fixed, and Moving Views. Our findings reveal that a hybrid strategy, combining continuous camera motion with diverse static viewpoints, yields the best performance by substantially reducing spurious correlations while maintaining training stability. Our experiments demonstrate that this strategy mitigates spurious correlations, enabling VLAs to generalize to unseen camera poses and object configurations where simply adding more static viewpoints fails. Crucially, we reveal that the susceptibility to shortcut learning and the struggle with spatial generalization are universal characteristics shared across diverse architectures. Consequently, all evaluated models (ACT, Diffusion, and VLA models including Pi0 and Gr00t) benefit significantly from our mixed data strategy.

0

cs.RO 2026-07-03

Low-cost drone runs real-time face tracking and line following

by Andrei-Marian Ungureanu, Stelian Spînu

Real-Time Visual Intelligence on Low-Cost UAVs: A Modular Approach for Tracking, Scanning, and Navigation

Modular AI on DJI Tello combines detection, recognition and depth estimation for tracking, scanning and navigation tasks.

abstract click to expand

Autonomous drones are rapidly transforming modern warfare and civil applications alike. This paper presents the development of an integrated intelligent drone system designed to serve as a personal assistant. Leveraging the DJI Tello drone platform, we implemented a modular architecture that integrates three core artificial intelligence functionalities: facial detection, facial recognition, and depth estimation from monocular vision. A web-based interface enables seamless drone control and real-time video monitoring, while a Python-based server processes visual data and executes inference pipelines using lightweight neural models optimized for embedded systems. Unlike existing commercial solutions, this system emphasizes accessibility, low-cost hardware, and open-source technologies. The system demonstrates robust performance in real-world conditions, including person tracking, indoor scanning, and autonomous line following using virtual sensors. This project validates the applicability of advanced AI techniques in real-time robotic systems and illustrates the feasibility of deploying them on constrained hardware, providing a foundation for future research in autonomous UAVs for military, rescue, and surveillance missions.

0

cs.RO 2026-07-03

Neuro-symbolic UAV succeeds in 61 of 72 landing tests

by Weixian Qian, Tianyi Yang +6 more

NEUROSYMLAND: Neuro-Symbolic Landing-Site Assessment for Robust and Edge-Deployable UAV Autonomy

Probabilistic scene graph plus symbolic rules beat vision baselines while fitting edge hardware limits.

abstract click to expand

Safe landing-site assessment in unstructured environments remains a key challenge for autonomous UAV deployment, as vision-only learning approaches often degrade under terrain variability and provide limited transparency in safety decisions. We present NEUROSYMLAND, a neuro-symbolic landing-site assessment system that integrates lightweight perception with explicit safety reasoning. The framework constructs a probabilistic semantic scene graph from onboard visual input and evaluates candidate landing regions using symbolic constraints capturing terrain flatness, obstacle clearance, and spatial consistency, enabling structured reasoning under perceptual uncertainty while maintaining edge-feasible execution. Across 72 simulated landing scenarios spanning diverse terrains, NEUROSYMLAND achieves 61 successful assessments, outperforming four competitive baselines (37-57 successes). To evaluate deployability, we further conduct 100 hardware-in-the-loop trials with randomized initial poses, profiling end-to-end latency, stage-wise execution time, and system-level metrics including CPU/GPU utilization, memory footprint, and power consumption. Results demonstrate improved robustness and interpretability with bounded edge-resource usage. Profiling shows that symbolic reasoning contributes only a small fraction of end-to-end latency, while the main computational cost arises from perception and PSSG construction. These results demonstrate the feasibility of deploying the landing-site assessment stack on edge-constrained UAV hardware, and all source code, datasets, prompts, and symbolic rule refinement examples are released in an open-source repository

0

cs.RO 2026-07-03

Flow fields outperform action baselines in continuous robot navigation

by Haokun Liu, Zhaoqi Ma +6 more

CoFL-S: Spatially Queryable Sector Flow Fields for Local Language-Conditioned Navigation

Sector flow fields conditioned on local instructions generate trajectories that beat token and chunk methods across frequencies in Habitat a

abstract click to expand

Vision-Language Navigation has increasingly emphasized high-level instruction reasoning, memory, global map construction, and instruction decomposition, while the low-level action representation remains comparatively underexplored. We propose CoFL-S, a low-level vision-language-action framework that predicts a language-conditioned flow field over the robot's local visible sector and generates continuous trajectories by rolling out the predicted field. To train this low-level representation, we convert each VLN-CE episode, originally a whole-episode instruction paired with an action sequence, into frame-level local supervision with aligned sub-instructions and matched action, trajectory, and dense flow-field targets. For evaluation, we introduce a continuous-time Habitat benchmark that isolates low-level action interfaces from instruction decomposition and executes all methods through a shared velocity-command controller, enabling decomposition-independent closed-loop comparison across different planner frequencies rather than fixed discrete forward-and-turn transitions in VLN-CE. Under matched encoders and training settings, CoFL-S consistently outperforms action-token and action-chunk baselines across planner frequencies in the continuous-time Habitat benchmark, and zero-shot real-world closed-loop deployment further shows its advantage over both baselines beyond simulation.

0

cs.RO 2026-07-03

Shaping real actuators to sim model enables zero-shot policy transfer

by Satoshi Yamamori, Koji Ishihara +5 more

Actuator Reality Shaping for Zero-Shot Sim-to-Real Robot Learning

Policies trained only with idealized reference dynamics deploy directly on hardware without fine-tuning or actuator models.

abstract click to expand

Sim-to-real transfer in robot learning is often limited by discrepancies between the ideal actuator dynamics assumed during policy training and the nonlinear, hardware-dependent behavior of physical motors. While conventional approaches attempt to bridge this gap by increasing simulator fidelity through system identification, domain randomization, or learned actuator models, we introduce an alternative paradigm: actuator reality shaping. Instead of modifying the simulator to match the real world, our method shapes the closed-loop behavior of physical actuators to match the idealized second-order reference dynamics used in simulation. By equipping each joint with a two-degree-of-freedom feedforward--feedback controller, we decouple reference-response shaping from robust stabilization, thereby providing a standardized actuator interface for reinforcement learning policies. As a result, policies trained only with the prescribed reference model can be deployed zero-shot on real hardware without task-level fine-tuning or learned actuator models. We validate the approach on a single-joint high-gear-ratio servo under external loads and a 7-DOF robotic arm reaching task, where actuator reality shaping substantially reduces sim-to-real tracking error and improves zero-shot task performance compared with standard servo-control and representative real-to-sim-to-real baselines. We further demonstrate zero-shot transfer on a wheeled-legged robot driving over a slope and a humanoid robot walking, suggesting that actuator reality shaping can serve as a reusable interface for robot learning across diverse hardware platforms.

0

cs.RO 2026-07-03

Change priors replace full future images for robot actions

by Yongjie Bai, Hanting Wang +4 more

Bridge-WA: Predicting Where and How the World Changes for Robotic Action

Distilling where and how scenes will change lets policies ignore backgrounds and lighting while raising success under new visuals.

abstract click to expand

General-purpose vision-language-action models benefit from large vision-language priors, but effective manipulation also requires anticipating action-relevant scene changes. Existing world-action models often rely on large generative world models or dense future rollouts, which are expensive and spend capacity on visual details weakly coupled to control. We present Bridge-WA, a lightweight world-action framework that distills a frozen future-change teacher into three compact priors: future tokens for intended outcomes, change maps for intervention support, and motion-flow maps for local transition direction. A WorldBridge conditions the action transformer on these priors through multi-source attention memories and spatial-temporal biases, while the teacher model is removed at inference. Across VLABench, RoboTwin2.0, LIBERO-Plus and real-robot evaluations, Bridge-WA improves task success, progress, and robustness, with particularly clear gains under out-of-distribution visual shifts. By focusing action generation on where and how the scene will change, Bridge-WA suppresses nuisance appearance factors such as background, lighting, and distractors, leading to better generalization without deployment-time dense future-image generation. Code and visualizations are available at: https://hcplab-sysu.github.io/BRIDGE-WA .

0

cs.RO 2026-07-03

Browser studio turns robot boats into music choreographers

by Aswin Ramachandran, Christopher Golling +5 more

Choreographing the Way of Water: A Computational Framework for Aquatic Robotic Art

Timeline tool lets artists direct fleets on water without writing code for each move

abstract click to expand

Robotic choreography in open water is governed by nonlinear fluid dynamics, which impose significant challenges due to environmental disturbances and nonlinear system dynamics. This paper presents the cyber-physical architecture of Way of Water, a vertically integrated framework that orchestrates a fleet of autonomous surface vessels as a distributed choreographic platform. Moving beyond the surface-pixel paradigm, these vessels use laminar nozzles and multi-zone lighting to extend their expressive range from the 2D water plane into the 3D volumetric domain. Our primary contribution is the Way of Water Studio, a browser-based, timeline-compositing authoring paradigm that treats the fleet as a DAW-like instrument for music-responsive choreography. The Studio encapsulates Sequential Convex Programming for trajectory generation and Model Predictive Control for disturbance rejection presented through a visual timeline, broadening access to high-performance aquatic robotics for non-programmer artists. Grounding the Studio is the full cyber-physical stack: a custom holonomic chassis, a state-estimation and control stack tuned for the aquatic domain, and an LTE/MQTT fleet link with RTK-GPS time synchronization. We report on the system's validation across two distinct deployments: an 18-vessel Swan Lake interpretation at Lake Zurich and an 8-vessel Time Space Existence 2025 Venice Biennale demonstration at Forte Marghera, establishing a foundational reference for the design and deployment of fluidic robotic swarms.

0

eess.SY 2026-07-03

RBF activation function shapes robotic tracking performance

by Kimmo Paldanius, Gabriel da Silva Lima +1 more

Influence of Radial Basis Activation Functions on Intelligent Controller for Robotic Manipulators

Experiments on a manipulator show stability for all kernels but clear differences in adaptation and accuracy

abstract click to expand

This paper presents an intelligent control framework for trajectory tracking of robotic manipulators using radial basis function (RBF) neural networks for online disturbance estimation. The proposed control structure combines model-based nonlinear control with an adaptive neural approximator that compensates for parametric uncertainties, friction, and unmodeled dynamics. A Lyapunov-based adaptation law with projection guarantees boundedness of the closed-loop signals and convergence of the tracking error to a compact region. The primary objective of this work is to investigate how the choice of activation function within the RBF network influences transient behavior, steady-state accuracy, and control smoothness. The controller is implemented on a robotic manipulator. Experimental results demonstrate that although stability is preserved for all kernels, activation function selection significantly affects adaptation dynamics and practical tracking performance. These findings demonstrate that activation function selection acts as a structural design parameter in intelligent control, directly shaping adaptation dynamics and practical closed-loop performance.

0

cs.RO 2026-07-03

Critic guidance raises frozen VLA success from 68% to 82%

by Liuhaichen Yang, Zhuang Jiang +2 more

Guided Action Flow: Q-Guided Inference for Flow-Matching Vision-Language-Action Policies

Action gradients from a rollout-trained critic steer the sampler of an unchanged flow-matching policy on LIBERO tasks.

abstract click to expand

Flow-matching vision-language-action policies generate robot action chunks through an iterative transport process, creating an opportunity for test-time guidance without retraining the base policy. We study this opportunity in Guided Action Flow, an inference-time framework that keeps a pretrained SmolVLA policy frozen and uses a learned action-chunk critic to guide its reverse-time flow sampler. The critic is trained from real success and failure rollouts, can condition on task-description features from the frozen SmolVLA language pathway, and is used only through action gradients during sampling. We evaluate the approach on LIBERO manipulation tasks. A single-task critic improves success from 68.0% to 82.0% on one seed window and from 82.0% to 86.0% on another. A multi-family task-description critic improves validation success from 46.0% to 56.0%, while the locked held-out test gain is positive but modest, from 65.0% to 67.5%. These results support the feasibility of Q-guided inference for frozen flow-matching VLA policies, while showing that critic generalization and uncertainty-aware guidance remain the central bottlenecks.

0

cs.RO 2026-07-03

One RL policy tracks trajectories on two different boats after simulation training

by Ruiheng Jiang, Thomas Bi +2 more

Cross-Platform Control for Autonomous Surface Vehicles via Adaptive Reinforcement Learning

Adaptive method with latent dynamics inference cuts position error up to 58% versus non-adaptive baselines on real platforms.

abstract click to expand

Autonomous surface vehicles vary widely in hydrodynamic and actuation characteristics, yet most controllers are designed for single-platform deployment. We present an adaptive reinforcement learning approach for trajectory tracking that enables zero-shot cross-platform deployment using a single policy. Since the deployment platform's dynamics are unknown to the policy, we address cross-platform generalization with the standard partial-observability approach of conditioning on interaction history, employing a teacher-student architecture in which a learned module infers a latent representation of the platform dynamics. The policy is trained in simulation under randomized vessel dynamics and is deployed zero-shot to two real-world platforms without any fine-tuning, despite relying on a simple analytical dynamics model rather than a high-fidelity hydrodynamic simulator. In real-world experiments on two different platforms, the adaptive policy outperforms non-adaptive learning-based baselines by up to 58% in position mean absolute error while approaching the tracking accuracy of a platform-specific tuned controller.

0

cs.RO 2026-07-03

Cross disparity flags moving points for accurate SLAM

by Sujan Kumar Dhali, Bhaskar DasGupta

A Stereo Visual SLAM System Using Object-Level Motion Estimation and Geometric Filtering Based on Cross Disparity

OCD SLAM adds geometric filtering and object tracking to ORB-SLAM2, cutting trajectory error on KITTI sequences with traffic.

abstract click to expand

This paper presents OCD SLAM, a dynamic stereo visual SLAM framework that extends ORB-SLAM2 by jointly addressing dynamic objects and dynamic features in the scene. Usual visual SLAM systems operating in dynamic environments often fail in the presence of moving objects, due to the static-world assumption used in pose estimation and mapping. To address this predicament, we introduce a novel geometric approach based on the discrepancy between disparity and a newly proposed notion called ``cross disparity'', which exploits both temporal and stereo inconsistency to identify dynamic feature points. Complementary to this feature-level motion analysis, OCD SLAM integrates a 3D object detection module (SMOKE) with Kalman filter-based object tracking to perform object-level motion classification, enabling robust separation of static and dynamic scene elements for accurate pose estimation. The proposed approach has been evaluated on various sequences from the KITTI Odometry and KITTI Raw datasets. Results demonstrate that OCD SLAM achieves significant improvement in trajectory accuracy compared to ORB-SLAM2 and several state-of-the-art dynamic SLAM methods. Ablation studies further demonstrate the effectiveness of the cross disparity module in the KITTI Raw dataset and show that this method is able to detect dynamic features that are missed by the 3D object detection scheme alone.

0

cs.AI 2026-07-03

Consolidation updates knowledge while identity hash stays byte-equal

by Xue Qin, Simin Luan +2 more

Episodic-to-Semantic Consolidation Without Identity Drift

Semantic layer is excluded from the manifest, so certified identity remains unchanged across consolidation steps.

abstract click to expand

Long-running adaptive intelligent agents face a structural tension between knowledge consolidation and information integrity. Memory consolidation is conventionally treated as an agent-changing operation: a model is fine-tuned, a prompt rewritten, a policy distilled, or a reflection appended to the context that governs future behaviour. In regulated autonomic deployment this is a liability because the agent operates under commitments and audit contracts that bind to a specific, cryptographically certified identity. We propose to treat consolidation not as a mutation of the planner or the identity manifest, but as a deterministic function f: M^ep -> M^sem over episodic memory whose output is a separately addressable semantic knowledge layer; the identity hash does not read M^sem, so consolidation updates knowledge without changing the agent's certified identity. We give a formal account of the agent representation, prove identity invariance through a structural lemma on the manifest's hash-input set, specify a deterministic aggregation algorithm whose outputs are auditable database rows with explicit confidence and supporting-event provenance, and validate the construction with synthetic experiments demonstrating per-field correctness, byte-equal identity across consolidation passes, and a mean 79.82% reduction in unproductive planner attempts (95% BCa CI [78.02%, 81.49%] across 10 seeds) against a calibrated Bayesian-shrunk baseline. The construction is a knowledge-update discipline for autonomic agents in which lessons accumulate as queryable facts while the agent's certified identity remains byte-equal across its operational lifetime, with an embodied service agent as the running case study.

0

cs.CV 2026-07-03

Manifold projections extract novel views from any video model

by Jinxi Li, Tianyi Zhang +5 more

NeoMap: Training-free Novel-View Synthesis from Single Images and Videos

Training-free iterations locate consistent viewpoints already present in the data manifold of pre-trained generators.

abstract click to expand

We study the challenging problem of novel view video synthesis from single images or monocular videos. Existing methods, which operate under the assumption that pre-trained video models lack native novel view synthesis capability and enforce view alignment via camera conditioning, task-specific fine-tuning, or stepwise hard denoising guidance, often suffer from artifacts and compromised global scene consistency. In this paper, we introduce NeoMap, a novel training-free framework designed to locate high-fidelity, view-consistent novel view solutions from general pre-trained video models. The key to our approach is the core insight that promising novel view solutions are inherently encoded within the natural video data manifold learned by pre-trained models, and the core challenge is simply to locate this optimal solution. We solve this via our core mechanism: convergent manifold alternating projection iterations that optimize the initial noise. Extensive experiments demonstrate that NeoMap significantly outperforms all existing methods across 3 standard novel view synthesis benchmarks, including the challenging Tanks-and-Temples, LLFF and DAVIS datasets, achieving state-of-the-art generation fidelity and top-tier view consistency.

0

cs.RO 2026-07-03

Divergence-free 3D Gaussian field improves robot dynamic manipulation

by Peng Yun, Shouwang Huang +4 more

PhysMani: Physics-principled 3D World Model for Dynamic Object Manipulation

Online optimization of the velocity field supplies physically grounded futures to a cross-attention policy on a new 16-task benchmark.

abstract click to expand

Manipulating fast and dynamically moving targets in unstructured 3D environments remains challenging for embodied AI. Existing visual-language-action models and world models struggle with accurate 3D geometry and physically meaningful forecasting. We propose PhysMani, a framework that couples a physics-principled 3D Gaussian world model with a future-aware action policy model. The world model learns a divergence-free Gaussian velocity field via online optimization for fast and physically grounded future dynamics prediction. The policy model integrates the predicted 3D scene future dynamics through a learnable token based cross-attention module. We introduce PhysMani-Bench, a dynamic manipulation benchmark with 16 tasks, and demonstrate a superior success rate over strong baselines in both simulation and real-world robot experiments.

0

cs.RO 2026-07-03

Social preference data replaces manual rewards for crowd robot navigation

by Zixuan Chen, Hao Fu +5 more

SPLC: Social Preference Learning for Crowd Robot Navigation

Automatic preference generation from pedestrian dynamics improves offline RL compliance without hand-crafted reward functions.

abstract click to expand

Offline reinforcement learning (RL) holds significant potential for crowd robot navigation in human-robot coexistence applications. However, the inherent complexity of pedestrian motion renders the design of effective reward functions for promoting socially compliant robot behaviors a persistent challenge. This paper proposes a Social Preference Learning for Crowd Robot Navigation (SPLC) algorithm to eliminate the need for detailed reward design. Its core innovation lies in the introduction of a social preference feedback mechanism to automatically generate preference data through principled preference evaluation criteria. By explicitly accounting for the intricacies of pedestrian dynamics, the pipeline mitigates the reward bias and facilitates the systematic quantification of broad social norms, thereby fostering socially compliant behaviors. Extensive experiments integrating SPLC with offline RL methods demonstrate consistent improvements over state-of-the-art baselines across standard performance metrics. Furthermore, real-world experiments on the TurtleBot4 further validate the effectiveness of SPLC in practical human-robot coexistence settings. Our code and video demos are available at https://github.com/sklus949/SPLC.

0

cs.CV 2026-07-03

Staged scattering model narrows underwater vision gap

by Seunghee Yun, Geonmo Yang +4 more

Robust Image Processing Techniques for Construction Environment Monitoring Using Underwater Robots

Retraining an existing network on synthetic data with depth-aware blur and real marine snow patterns raises quality scores on coastal datase

abstract click to expand

This paper proposes a robust image processing framework for underwater robot-based construction environment monitoring, targeting complex degradations observed in real marine environments. Unlike conventional approaches that mainly consider absorption and backscattering, real underwater imagery is strongly affected by depth-dependent forward scattering blur and particle-induced degradations such as marine snow. To address this, we introduce a staged processing pipeline that sequentially models background degradation via depth-aware forward scattering and foreground degradation using realistic marine snow patterns extracted from real images. The resulting synthetic data are used to retrain an existing Joint-ID network without modifying its architecture, enabling an isolated evaluation of dataset realism. In addition, a lightweight post-processing scheme is applied to enhance contrast and structural clarity. Experiments on real underwater datasets collected in Korean coastal environments demonstrate consistent improvements in visual quality and UIQM scores. The results indicate that explicitly modeling forward scattering and realistic particle effects effectively reduces the synthetic-to-real gap and improves practical applicability in real-world underwater robotic operations.

0

cs.RO 2026-07-03

Object-level probabilities prune dynamic Gaussians for cleaner SLAM

by Ziheng Xu, Qingfeng Li +3 more

DL-SLAM: Enabling High-Fidelity Gaussian Splatting SLAM in Dynamic Environments based on Dual-Level Probability

Dual-level fusion of semantic and geometric cues improves tracking by up to 13% while keeping maps free of motion artifacts.

abstract click to expand

Recent advances in 3D Gaussian Splatting (3DGS) have enabled significant progress in dense dynamic Simultaneous Localization And Mapping (SLAM). Prevailing methods typically discard predefined dynamic objects, ignoring that transiently static objects offer valuable geometric constraints for pose estimation. A recent work attempts to leverage this potential by employing per-pixel uncertainty maps to quantify the magnitude of motion. While this approach enables transiently static objects to enhance pose estimation, it erroneously integrates these objects into the static map, resulting in persistent artifacts. Moreover, its reliance on purely geometric information leads to ambiguous object boundaries in the uncertainty maps. To overcome these limitations, we present DL-SLAM, a monocular Gaussian Splatting SLAM system built upon a novel dual-level probabilistic framework. Our method computes dynamic probability maps by combining semantic and geometric information. These pixel-level probabilities are lifted to 3D and aggregated to derive an object-level dynamic probability for each instance. Object-level probability enables the categorical pruning of dynamic Gaussians, resulting in an artifact-free static map. The static map, in turn, provides a geometrically consistent guidance to refine the pixel-wise probabilities, enhancing their reliability. Experimental results demonstrate that DL-SLAM outperforms existing approaches, improving tracking accuracy by up to 13\% while generating high-fidelity semantic maps.

0

cs.RO 2026-07-03

VLA-Corrector turns fixed action chunks into on-demand adaptive horizons

by Yi Pan, Miao Pan +9 more

VLA-Corrector: Lightweight Detect-and-Correct Inference for Adaptive Action Horizon

A lightweight monitor spots visual drift and triggers gradient corrections to limit compounding errors without retraining the base policy.

abstract click to expand

Vision-Language-Action (VLA) foundation models have recently achieved strong progress in embodied intelligence. To reduce policy-call frequency while preserving temporal coherence, most generative policies adopt an action chunk mechanism, executing multiple future actions in an open-loop manner under a fixed action horizon. However, this "predict-then-blindly-execute" paradigm sacrifices closed-loop reactivity: in contact-rich physical interactions, even small local perturbations can rapidly amplify within the open-loop blind spot, leading to compounding errors and ultimately task failure. To address this limitation, we propose VLA-Corrector, a lightweight corrective inference framework for action-chunked VLA policies. Without modifying the backbone policy weights, VLA-Corrector introduces a lightweight Latent-space Vision Monitor (LVM) that continuously compares predicted and actual visual feature evolution, enabling online detection of visual dynamics deviations. Once persistent deviation is detected, the system triggers a truncation event, discards the remaining stale actions, and invokes corrective replanning via Online Gradient Guidance (OGG). The detect-and-correct mechanism of VLA-Corrector naturally induces an event-triggered adaptive action horizon: it preserves long-horizon execution when the current chunk remains reliable, and invokes short-horizon corrective replanning when execution begins to drift. In doing so, VLA-Corrector mitigates the trade-off imposed by static horizons between execution robustness and policy-call frequency. It can be integrated into different VLA models without further retraining the VLA backbone, interrupting compounding errors while preserving much of the efficiency benefit of action chunking and substantially improving robustness in long-horizon, contact-rich robotic manipulation tasks.

0

cs.CV 2026-07-03

Pixel diffusion creates 3D Gaussian splats directly in one pass

by Duy Cao, Phong Nguyen-Ha

PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation

Bypassing latent compression produces higher-quality assets with 1-second inference on a single GPU.

abstract click to expand

Recent advances in 3D content generation from text or images have achieved impressive results, yet view inconsistency from 2D generators and the scarcity of high-quality 3D data remain significant bottlenecks. Existing solutions typically adapt large-scale pre-trained text-to-image latent diffusion models to generate 3D Gaussian Splats (3DGS). However, these approaches often rely on training complex cascade pipelines that are computationally expensive and scalability-limited. Most critically, the quality of generated 3D assets is inherently constrained by each component capacity and compressed latent space, leading to decoding artifacts and accumulated errors. To address these limitations, we propose PixGS, a single-stage pipeline for direct high-quality 3DGS generation, which leverages recent advances in pixel-space diffusion to bypass lossy latent compression while still benefiting from the vast 2D generative priors. By directly denoising 3D Gaussian attributes at each timestep, our method enables precise, splat-level regularization of both appearance and geometry. Furthermore, we introduce a comprehensive supervision strategy that incorporates surface normals, depth, and high-frequency structural information, which is often overlooked in prior works. Experiments demonstrate that PixGS outperforms current state-of-the-art methods while maintaining a fast inference speed (1s on a single A100 GPU), offering a robust and efficient alternative to multi-stage generation pipelines.

0

cs.RO 2026-07-03

Lightweight safe RL navigates UAVs through dense obstacles

by Shenghui Zhang, Yuxuan Gao +4 more

Lightweight Safe Reinforcement Learning for End-to-End UAV Navigation

Sparse-depth encoder and Lagrangian PPO raise success and safety over baselines at multiple speeds and densities.

abstract click to expand

With the rapid development of autonomous aerial systems, Unmanned Aerial Vehicles (UAVs) are increasingly deployed in applications such as inspection, environmental monitoring, and rescue, creating growing demand for reliable autonomous navigation. However, autonomous UAV navigation in dense environments remains challenging under sparse perception and dynamic constraints. Most reinforcement learning (RL) methods lack explicit safety mechanisms, leading to unsafe exploration, unstable training, and risky behaviors, especially during high-speed flight. Even in safe RL approaches, safety is often enforced by projecting policy outputs onto a safe action set, which may introduce instability. Meanwhile, many learning-based methods rely on dense inputs or large networks, increasing computational burden and limiting lightweight onboard deployment. Facing the above challenges, we propose a safety-constrained perception-control integrated framework for UAV navigation. A lightweight network encodes sparse observations into collision-risk-aware features using asymmetric and depthwise separable convolutions. We formulate the task as a constrained Markov decision process within a hierarchical control architecture and solve it using a Lagrangian-based safe PPO algorithm. Curriculum learning further improves training stability. Experiments with varying obstacle densities and flight speeds demonstrate higher success rates, improved safety, and better efficiency than existing reinforcement learning baselines.

0

cs.CV 2026-07-03

Learned front-ends cut VI-SLAM error by up to 38 percent on some datasets

by Shoon Kit Lim, Melissa Jia Ying Chong +1 more

DL-VINS-Factory: A Modular Framework for Learned Visual Front-Ends in Visual-Inertial SLAM

Benchmarks show ALIKED+LG gains on EuRoC and NTU-VIRAL while classical tracking wins on Botanic Garden, all running real-time on Jetson hard

abstract click to expand

Deep-learning features excel in visual matching, yet their practical value in tightly coupled visual-inertial SLAM (VI-SLAM) remains insufficiently characterized. We present DL-VINS-Factory, a unified framework that integrates learned feature extractors (ALIKED, RaCo, SuperPoint, XFeat) with either Lucas--Kanade (LK) optical-flow tracking or LightGlue (LG) descriptor matching. All front-ends share a sliding-window Ceres back-end, with optional AnyLoc DINOv2-VLAD loop closure, and 4-DoF pose-graph optimization. We benchmark the system across the four datasets covering indoor, unstructured outdoor, aggressive-motion, and visually degraded conditions. Results show that learned front-ends are viable for real-time embedded VI-SLAM, but are not universally superior to classical tracking. Relative to the corresponding GFTT+LK baseline, ALIKED+LG reduces EuRoC ATE by $5\%$ in monocular odometry and by $7\%$ in stereo with loop-closure. On NTU-VIRAL, where aggressive aerial motion increases inter-frame viewpoint change, ALIKED+LG stereo reduces loop-closed ATE by $12\%$. In Botanic Garden dataset, optical-flow tracking remains preferable, but learned keypoints still improve over the baseline GFTT, in which SuperPoint+LK reduces grayscale camera ATE by $29\%$, while RaCo+LK reduces RGB camera ATE by $38\%$. On SubT-MRS, learned front-ends display varying degree of improvement based on individual cases. With TensorRT acceleration on a Jetson AGX Orin, all valid configurations run in real time between $29$--$47$ FPS in monocular mode and $18$--$33$ FPS in stereo mode for the EuRoC and NTU-VIRAL datasets. AnyLoc further confirms roughly $2$--$7\times$ more valid loops than BRIEF+DBoW2. The implementation is open-sourced at https://github.com/limshoonkit/DL-VINS-Factory-ROS2/.

0

cs.RO 2026-07-03

Vision models build hybrid rewards for aligned robot policies

by Hexian Ni, Tao Lu +1 more

CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning

CoRe decomposes rewards into formal and residual parts optimized by VLMs, outperforming baselines on ten simulation and five real-world mani

abstract click to expand

Reward design remains a central challenge in reinforcement learning (RL). Hand-crafted rewards are often difficult to specify and may lead to suboptimal policies, while learned rewards from preferences can suffer from inefficiency and unstable training. Inspired by the dual nature of human learning explored in cognitive science, we decompose rewards into two complementary components: Formal Rewards (FR), explicitly designed based on task knowledge, and Residual Rewards (RR), learned from observations to capture implicit and nuanced preferences. Based on this decomposition, we propose CoRe, a hybrid framework that integrates FR and RR with vision-language models (VLMs) feedback to achieve preference-aligned policies without human involvement. Our contributions are twofold: (1) We propose a Formal Reward Module (FRM) that leverages VLMs to iteratively design and optimize FR based on task knowledge and preference feedback, enabling the continual improvement of policy during training; (2) We introduce a Residual Reward Module (RRM) that learns RR from video-level preference by employing VLMs to generate preference labels and capturing nuanced rewards that complement FR, ensuring alignment with human intent. Through the synergy of FRM and RRM, CoRe enables the automatic construction of reliable rewards that are efficient and preference-aligned. Extensive experiments demonstrate that CoRe outperforms existing approaches in terms of policy learning effectiveness and efficiency on ten robotic manipulation tasks in simulation and five real-worlds. Videos can be found on our project website: https://core-2026.github.io/

0

cs.RO 2026-07-03

Imagined tactile signals boost robot tasks up to 44 percent

by Zhiyuan Zhang, Adeesh Desai +8 more

Imagining the Sense of Touch: Touch-Informed Manipulation via Imagined Tactile Representations

Framework predicts touch from vision during training so policies improve contact and texture performance at test time with cameras only.

abstract click to expand

Tactile sensing can substantially improve contact-rich robotic manipulation, yet its practical deployment remains limited by the fragility, calibration requirements, and maintenance burden of tactile hardware. This raises a fundamental question: can robots benefit from tactile knowledge without requiring tactile sensors at deployment? We present TacImag, a tactile imagination framework that predicts tactile observations from vision and proprioception and uses the generated signals to guide manipulation policies. Trained from paired visuotactile demonstrations, TacImag enables touch-informed manipulation using only visual observations at test time. We evaluate TacImag on six simulated and four real-world manipulation tasks. Across simulation and real-world experiments, imagined tactile observations consistently improve manipulation performance without requiring tactile hardware. In real-world experiments, imagined force fields improve contact-sensitive tasks by 44.4% on average, whereas imagined tactile images improve texture-sensitive tasks by 23.3%, revealing that the effectiveness of tactile imagination depends strongly on the relationship between tactile representation and task requirements. Our results further suggest that tactile imagination does not simply recover missing tactile measurements. Instead, it acts as a form of contact-aware supervision that transforms subtle visual interaction cues into representations that are easier for manipulation policies to exploit.

0

cs.RO 2026-07-03

One demonstration automates real-world robot RL

by Yuwan Liu, Hongze Yu +6 more

One Demonstration Is Enough for Real-World Robotic Reinforcement Learning

AutoSERL's sliding-window guidance and recovery points match human-in-loop results on insertion and hanging tasks without further human inpu

abstract click to expand

Learning effective robot control policies on physical hardware is challenging due to costly data collection and the difficulty of reward specification. Prior work has incorporated demonstrations into reinforcement learning (RL), yet existing approaches either require large numbers of demonstrations or depend on continuous human intervention during training. To address these limitations, we present AutoSERL, a framework that leverages a single demonstration to fully automate the intervention process in real-world robot RL. The framework includes three complementary mechanisms to accomplish certain tasks: a sliding window intervention mechanism that continuously guides exploration to prevent local optima and unsafe deviations, a safety recovery mechanism that detects and corrects failure states via predefined trajectory recovery points, and an intervention termination criterion that automatically disables guidance once the policy can independently complete the task, preserving its exploration advantage. We evaluate AutoSERL on six contact-intensive manipulation tasks across two robot platforms, spanning insertion, hanging, and hinge-based tasks. AutoSERL consistently outperforms SERL initialized with 20 demonstrations, behavior cloning, and MILES -- a dedicated one-shot imitation learning baseline -- across all tasks while matching HIL-SERL, achieves 100% success rate on insertion tasks, and demonstrates improved robustness to positional variations, all from a single demonstration. Code and videos are available on our project website: https://autoserl.github.io/.

0

cs.RO 2026-07-03

USV local planner routes behind moving obstacles

by Daniel G. Schwartz

Path planning for unmanned naval surface vehicles

Direction logic relative to global path chooses systematic avoidance instead of running parallel.

abstract click to expand

There nowadays is a myriad of approaches to real-time avoidance of fixed obstacles for unmanned surface vehicles (USVs) and, to a lesser extent, also the task of avoiding moving obstacles such as boats, ships, swimmers, and other USVs, but both topics still present challenges. This paper offers novel approaches to both of these problems. It uses a combination of a global path planner, which finds a path from a start point to a goal point that avoids fixed obstacles (given that their locations are known in advance), and a local path planner, which can circumnavigate a moving obstacle (as well as any previously unknown fixed obstacles). The global planner is novel in that it employs a combination of three path planners, one known in the literature as Grassfire, one that is a new modification of Grassfire, and one that is a new, and arguably more intuitive, version of the well-known Probabilistic Roadmap. The local planner is novel in that it employs a higher-level decision logic based on its observations regarding the direction of movement of the obstacle relative to the USVs global path. This logic enables the USV to determine the best strategy for avoiding the obstacle by systematically routing the vehicle behind the obstacle rather than running parallel to it until the opportunity to pass appears. Simulations are provided that validate these claims. For comparison with other systems, the simulations include an implementation of the well-known D* algorithm, and the discussion covers additional dynamic path planning systems, which, like D*, do not necessarily route the vehicle behind the moving obstacle.

0

cs.CV 2026-07-03

Language and future signals stabilize VLA transfer on mixed robot data

by Guoyang Xia, Fengfa Li +5 more

VLAFlow: A Unified Training Framework for Vision-Language-Action Models via Co-training and Future Latent Alignment

A single framework shows the combination of co-training and latent alignment reduces sensitivity to data heterogeneity across three benchmar

abstract click to expand

Vision-language-action models (VLAs) have recently advanced robotic manipulation, yet the effects of different robot-data pre-training paradigms remain difficult to compare because existing models often differ in architecture, data, action space, and evaluation protocol. We present VLAFlow (Vision-Language-Action Flow), a unified flow-matching framework for controlled comparison of VLA training objectives. Using a heterogeneous robot corpus, OXEMix, containing approximately 5,000 hours of data from DROID, OpenX-Embodiment, OpenX-Augmented, and RoboCOIN, we evaluate four paradigms under the same pi0-style architecture, shared VLM backbone, action expert, and 14-dimensional action space: action-only modeling (MindPI), language-supervised co-training (MindLPI), future latent alignment (MindWPI), and their combination (MindLWPI). Experiments on LIBERO, LIBERO-Plus, and SimplerEnv show that action-only pre-training is sensitive to heterogeneous data. In contrast, language supervision helps preserve vision-language generalization, while future latent alignment improves state-transition and action-outcome modeling. By combining both signals, MindLWPI achieves the most stable overall transfer performance across benchmarks. These results suggest a meta-action space view: language and future latent representations provide complementary intermediate constraints that make heterogeneous action supervision smoother and more transferable.

0

cs.RO 2026-07-03

MR-NMPC raises quadruped success rate 2.9x on rough terrain

by Taizoon Chunawala, Jeeseop Kim +1 more

Multi-Rate Nonlinear Model Predictive Control for Wall-Supported Bipedal Locomotion of Quadrupedal Robots

High-level optimizer plans contacts and CoM motion together; low-level tracker follows full dynamics for wall-assisted bipedal gaits.

abstract click to expand

This paper presents a novel layered planning and control framework based on multi-rate nonlinear model predictive control (MR-NMPC) that enables quadrupedal robots to perform hybrid bipedal locomotion with wall-assisted support in constrained environments. Real-time trajectory optimization for this locomotion presents significant challenges, as the controller must simultaneously plan for both the contact points and the continuous trajectories of the robot's center of mass (CoM) and orientation within the robot's nonlinear dynamics while accounting for unilateral contact constraints, underactuation, and the switching nature of the robot's dynamics. At the high level of the control framework, an MR-NMPC is proposed, which dynamically plans both the discrete-time trajectories of the contact points and the continuous-time trajectories of the CoM and orientation, using a single rigid body (SRB) dynamics model. By incorporating contact-point planning within the multi-rate optimal control framework, this approach enhances dynamic stability compared to heuristic foot placement strategies. At the low level of the control framework, a nonlinear whole-body controller (WBC) based on virtual constraints and a quadratic program enforces full-order dynamics and tracks the MR-NMPC references. The proposed approach is validated through extensive numerical simulations demonstrating the robust wall-assisted bipedal locomotion of a Unitree A1 quadrupedal robot on rough terrains and under external disturbances in a constrained environment. Comparative analysis shows that the proposed MR-NMPC achieves a 2.9 times higher success rate compared to conventional MPC with heuristic-based foot placement strategies in negotiating irregular terrain at high speeds.

0

cs.RO 2026-07-03

Robot turns in place five times faster by dropping to four wheels

by Kento Koizumi, Tomoaki Ohba +3 more

A Reconfigurable Rocker-Bogie Robot for High Step Climbing and Turning

Active bogie motors switch it back to six wheels for 40 cm climbs while cutting turning torque to 17 percent

abstract click to expand

This study proposes a reconfigurable rocker-bogie mechanism that achieves efficient turning motion with a small number of actuators while maintaining high step-climbing capability. By installing motors at the bogie joints and actively swinging up and down bogies, the system enables switching between four-wheel and six-wheel configurations. Omnidirectional wheels are mounted on the rear ends of the rockers, allowing smooth turning in the four-wheel configuration based on a differential-drive model. Experimental evaluation using a prototype robot demonstrated that the proposed mechanism achieves zero-radius turning at a speed more than five times that of a conventional rocker-bogie mechanism equipped with six non-steerable grip wheels, while requiring only approximately 17% of the total average wheel torque. In addition, the robot successfully climbed a 40 cm step with an average climbing time of 6.4 s, confirming its high turning and step-climbing performance.

0

cs.CR 2026-07-02

Scene text forces LVLMs to overthink and slows robots up to 7x

by Qiang Han, Jie Wu +1 more

Overthink-Triggered Slowdown Attacks on LVLM-Based Robotic Systems

Black-box attacks embed readable triggers in scenes to delay decisions, with physical prints still effective

abstract click to expand

Large Vision-Language Models (LVLMs) have been increasingly integrated into robotic systems. However, these models may exhibit overthinking behaviors, where they generate excessively long reasoning traces, incurring an excessive inference time. This overthinking behavior poses a serious risk to robotic systems, as the adversary can deliberately trigger overthinking to slow down the decision making of a victim robotic system, causing a variety of safety issues (i.e., an overthinking-induced slowdown attack). To initiate this attack, an adversary can embed carefully crafted, human-readable scene text into the visual scene observed by a victim robotic agent, causing significant inference delays even under a strict black-box setting. Therefore, the embedded scene text serves as a significant "trigger" for the attack. This work systematically identifies and validates transferable triggers of overthinking in robotic systems by introducing a three-stage framework. First, we construct a diverse corpus of reasoning-intensive scene text and extract overthinking-correlated lexical features from short response prefixes. Second, we perform an efficient black-box search guided by a prefix-based proxy score while selectively confirming a small set of top candidates with full latency measurements. Third, we evaluate black-box transfer using a fixed pool of triggers on unseen images and multiple LVLMs, reporting latency amplification and attack success rates under standard thresholds. Across three representative LVLMs, all triggers yield slowdown ratios greater than 1.0x, with the strongest single-trigger case reaching 6.96x. The physical printing of the text trigger still causes up to 4.74x latency amplification. These results demonstrate that our discovered triggers are transferred between multiple LVLM models and consistently cause significant slowdowns in robotic systems.

0

cs.RO 2026-07-02

SE(2) mesh captures over 50 percent more traversable area

by Shuyang Shi, Kaixian Qu +4 more

SE(2) Navigation Mesh

It layers yaw-specific footprint checks on top of a polygonal map so non-circular robots can use space that heading-invariant meshes discard

abstract click to expand

Global navigation for ground robots in complex multi-level environments requires representations that accurately capture traversable regions while enabling efficient path planning. Current approaches present key limitations: Point clouds and volumetric occupancy maps lack explicit surface structure for traversability estimation, whereas direct pathfinding on dense triangle meshes is computationally prohibitive. Navigation meshes mitigate these challenges through polygonal abstraction of the underlying mesh, but assume yaw-invariant traversability, rendering them unsuitable for non-circular robots in constrained spaces. We propose SE(2) Navigation Mesh (SE(2) NavMesh), a polygonal representation of traversable regions that encodes yaw-dependent traversability. Our method evaluates traversability using footprint masks and constructs a graph over yaw-specific layers with explicit translational and rotational connectivity. Grounded in this representation, we develop an A*-String Pulling-A* (ASA) pathfinding strategy that hierarchically optimizes robot position and heading. We also present an online method that incrementally updates the SE(2) NavMesh from streaming point clouds during concurrent geometry reconstruction. In simulation, the SE(2) NavMesh captures over 50% more traversable area than classical NavMeshes, and the SE(2) NavMesh + ASA pipeline consistently outperforms sampling-based baselines in constrained environments. Extensive real-world experiments on a physical robot validate real-time online generation and successful navigation across multiple environments.

0

cs.RO 2026-07-02

BIFROST creates shared latents for zero-shot sim-to-real robot transfer

by Yunfu Deng, Josiah P. Hanna

BIFROST: Bridging Invariant Feature Representation for Observation-space Sim2Real Transfer

A bisimulation objective on paired observation sequences bridges visual and dynamics gaps so simulation policies run directly in reality.

abstract click to expand

Sim2real transfer for robot policy learning suffers due to mismatch between simulation and reality. Existing methods typically address each gap in isolation through separate adaptation modules, which are composed or layered when both gaps coexist. Yet the basis for attempting sim2real in the first place is that there is shared structure between a task in simulation and reality, where equivalent actions from equivalent configurations produce equivalent long term outcomes regardless of domain specific differences in rendering or physics. In this paper, we study whether we can identify and exploit this shared structure from raw observations to train a policy that enables zero shot transfer. We introduce BIFROST, which learns a shared history encoder on paired cross-domain data via cross-domain bisimulation objective: observation-action sequences leading to equivalent long-term behavior are mapped to nearby latent states, regardless of domain. Policies trained on these latent states in simulation transfer zero-shot to reality. We provide empirical evidence on sim2sim visual navigation and sim2real contact rich manipulation task and visual servoing task that BIFROST achieves effective transfer where domain adaptation and co-training baselines fail under both visual and dynamics domain gaps.

0

cs.RO 2026-07-02

Simulator syncs time to wall clock for live human-AV planner tests

by Yunfei Bi, Youran Wang

CommonRoad-Game: A Human-in-the-Loop Simulation Framework for Autonomous Driving

The multi-threaded design supports scalable simulations and records interactions to create reproducible planner test cases.

abstract click to expand

Motion planning algorithms should be evaluated in human-in-the-loop environments to ensure they produce safe and efficient behaviors during interactions. However, existing simulation platforms often rely on recorded datasets, lack dedicated interfaces for real-time human interaction, or remain weakly integrated with an autonomous driving ecosystem. Moreover, many human-in-the-loop simulators are computationally intensive by design, making them less suitable for rapid prototyping and flexible experimentation in early-stage autonomous driving research. To address these limitations, we present CommonRoad-Game, a lightweight human-in-the-loop simulation framework tightly integrated with the CommonRoad platform, focusing on the systematic testing of motion planners with human participation and the analysis of human driving behaviors in interactive scenarios. We introduce a multi-threaded architecture with a robust synchronization mechanism that aligns simulation time with wall-clock time, enabling deterministic and temporally consistent interaction between autonomous and human-driven vehicles. In addition, the framework provides a scenario generation module that records driving logs, allowing diverse and reproducible test cases to be constructed from human-in-the-loop experiments. Experimental results demonstrate that CommonRoad-Game achieves stable temporal synchronization, supports scalable multi-agent simulation, and seamlessly integrates CommonRoad-compatible motion planners to generate interactive driving scenarios. The source code is publicly available at https://github.com/Yunfei-Bi8/CommonRoad-Game.

0

cs.RO 2026-07-02

Flow matching safety guidance reaches 82.8% collision avoidance

by William English, Hao Zheng +1 more

Neuro-Symbolic Safety Guidance for Vision-Language-Action Models via Constrained Flow Matching

Predictive avoidance via constrained optimization during denoising improves task success by 19.8% over single-step methods.

abstract click to expand

Vision-Language-Action (VLA) models have demonstrated promising generalization capabilities across robotic manipulation tasks, yet their real-world deployment remains limited by the lack of effective safety measures. Specifically, existing safety measures only prevent collisions caused by the robot's next action. In this paper, we propose a neuro-symbolic safety guidance mechanism for flow matching based VLAs that enables predictive collision avoidance. Flow matching based VLAs determine the next actions by predicting a trajectory (a sequence of actions) through an iterative neural flow matching process. Our method formulates safety enforcement as a minimum-norm constrained optimization problem that corrects safety violations during the denoising process of noisy intermediate trajectory predictions. By analyzing predicted trajectories and applying corrections during iterative denoising, our approach anticipates collisions before they become unavoidable. This interleaving of symbolic constraint satisfaction with neural trajectory generation enables predictive collision avoidance rather than reactive intervention. On the SafeLIBERO benchmark, our method achieves 82.8% collision avoidance and 81.6% task success, a 6.3% and 19.8% improvement respectively over single-step methods, with the largest gains on long-horizon tasks where compounding distribution shift is most pronounced. Video demonstrations of our approach are included on our project page at https://willenglish.tech/SafetyGuidedFlowMatching/.

0

cs.MA 2026-07-02

Generalized reward gives orbit agents control over image timing

by Patrick Quinn, Bala Prenith Reddy Gopu +2 more

Simulation Based Reward Function Validation for Multi-Agent On Orbit Inspection

Analysis of 3D reconstructions frees MARL spacecraft from fixed inspection points and lets them evaluate arbitrary image sets.

abstract click to expand

A proposed method for the control of groups of inspection spacecraft is Multi-Agent Reinforcement Learning (MARL). While MARL has already been employed for this purpose in previous work, the reward functions used focus on reaching a finite set of predetermined inspection points around the target. In this work, we study and develop a generalized reward function for the MARL inspection task informed by the analysis of 3D reconstructions of inspected objects in orbit. Because the reward function is generalized such that any number of images at arbitrary locations may evaluated, we also allow trained agents to have complete control over when images are collected. With this approach, we gather insights into best practices for not only the specific MARL inspection task, but also gain key takeaways informative to the broader inspection task outside of a MARL context.

0

cs.RO 2026-07-02

Progress signal raises bimanual assembly success to 80%

by Chenyang Ma, Yue Yang +5 more

FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model

Vision-language-action model predicts actions and task progress to manage seven subtasks over 1550 steps.

abstract click to expand

Current work on robot furniture assembly mostly focuses on toy-scale settings or single-arm manipulation. We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs). We formalize the task, develop a scalable simulation pipeline for expert data generation and evaluation, and build a VR teleoperation system for single-operator bimanual control to collect high-quality real-world demonstrations. To address extreme long-horizon assembly with up to 7 subtasks and 1550 control steps, we propose a progress-enhanced VLA, finetuned on semantically grounded subtasks, that jointly predicts actions and a continuous progress signal, enabling automatic subtask transitions and reducing compounding errors during inference. We further study perception and control design factors that critically affect precision in real-scale assembly. FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study. We validate on a real Kinova Gen3 platform with only 16% drop on the hardest task.

0

eess.SY 2026-07-02

The paper develops GPU-parallel methods to compute tight linearization error bounds for…

by Jeffrey Fang, Keyi Shen +2 more

GPU-Parallel Linearization Error Bounds for Real-Time Robust Optimal Control of Nonlinear and Neural Network Dynamics

Tight linearization bounds computed in parallel allow online optimization of feedback policies with formal guarantees up to 168 dimensions.

abstract click to expand

This paper studies real-time robust optimal control for uncertain nonlinear systems, where linear time-varying (LTV) approximations make planning tractable but require sound linearization error bounds (LEBs) to guarantee robust constraint satisfaction. We develop tight, differentiable, GPU-parallel LEBs for LTV approximations of nonlinear and neural network (NN) dynamics. For analytic dynamics, we introduce path-based Hessian bounds that are tighter than standard interval methods. For NN dynamics, we derive certified LEBs using NN verifier-generated affine relaxations and local Jacobian corrections. We adapt a GPU-parallel system-level synthesis LTV-based robust control solver to be compatible with these LEBs by extending it to handle right-invertible disturbance matrices and non-zero-centered disturbance sets for tight zonotopic uncertainty propagation. Our method, GPUSLS-LEO, enables online optimization of robust feedback policies that account for linearization error, producing tight, formally verified reachable tubes. On complex nonlinear and NN dynamics up to 168 state dimensions, our method can compute robust control policies on the GPU at rates up to 67 Hz, reducing solve times and conservativeness relative to baselines while preserving formal guarantees and real-time performance.

0

cs.RO 2026-07-02

Inverse dynamics model replaces force sensors in four-channel teleop

by Amir Noohian, Dylan Miller +3 more

Sensorless Four-Channel Control Architecture Using Inverse Dynamics Modeling for Human-Scale Bilateral Teleoperation

WAM experiments show improved tracking and higher impedance without external force sensors.

abstract click to expand

The four-channel teleoperation architecture is a well-established framework for achieving transparency in bilateral systems. However, its performance in human-scale teleoperation is limited by high inertia, modeling challenges, and reliance on noisy and costly force/torque sensors. This paper introduces a sensorless four-channel architecture based on inverse dynamics modeling. The controller is implemented and validated on a customized WAM bilateral teleoperation setup. Experiments demonstrate that the proposed approach outperforms conventional two- and four-channel schemes as well as transparency-enhancement methods, improving position and force tracking, reducing operator effort, and increasing maximum transmittable impedance without external sensors. A door-opening case study involving sustained whole-body contact along the manipulator further demonstrates the effectiveness of the method in realistic human-scale manipulation tasks.

0

cs.RO 2026-07-02

Quadrotor safety filter on 3DGS cuts jerk 47%

by Tscholl Dario, Nakka Yashwanth Kumar +1 more

FastBridge: Closing the Model-Based Realization Gap in Safety Filters on 3D Gaussian Splatting for Fast Quadrotor Flight

Full-dynamics CBF with backup policy respects actuator limits for real-time cluttered flight.

abstract click to expand

Fast quadrotor flight requires safe obstacle avoidance under tight onboard compute limits. While 3D Gaussian Splatting (3DGS) provides a continuous, geometry-aware scene representation for perception-driven navigation, existing 3DGS safety filters use reduced-order models such as single- and double-integrators that ignore actuator limits and assume commanded accelerations are realized instantaneously. Building on an analytic collision cone barrier for 3DGS, we introduce a nonlinear, actuator-aware safety filter enforced through the full quadrotor dynamics. We derive a high-relative-degree collision cone exponential CBF and a backup CBF that preserves QP feasibility under input constraints using a forward-simulated backup policy. Compared with a state-of-the-art 3DGS safety filter, our approach reduces trajectory jerk by 47% and runs 2.25 times faster. We validate the method in simulation and on hardware for real-time navigation in cluttered, perception-derived environments.

0

cs.RO 2026-07-02

4D latent model improves robot planning with 3D consistency

by Zhiyi Li, Peilin Wu +3 more

Structured 4D Latent Predictive Model for Robot Planning

Predicts scene evolution in structured space, outperforming video planners on manipulation and real robots.

abstract click to expand

Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene's 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at https://structured-4d-model.github.io/.

0

cs.CV 2026-07-02

Metric-free training tops every Waymo forecasting score

by Markus Knoche, Daan de Geus +1 more

Towards Metric-Agnostic Trajectory Forecasting

Optimizing the full trajectory distribution then extracting metric-specific outputs with TraDiE policies sets new records without tailoring

abstract click to expand

Accurate trajectory forecasting of surrounding traffic participants is a core capability for autonomous driving, enabling vehicles to anticipate behavior and plan safe maneuvers. We observe that current state-of-the-art forecasting models on Argoverse 2 and the Waymo Open Motion Dataset tailor their training objectives to the different benchmark metrics. Because these metrics encourage conflicting behavior, we propose a paradigm change for trajectory forecasting: training models with metric-agnostic probabilistic objectives and treating metric optimization as a downstream task applied to the predictive distribution. Concretely, we introduce Trajectory Distribution Evaluation (TraDiE) policies, metric-specific policies that map a predictive distribution to the set of $K$ trajectories and confidences required by trajectory forecasting metrics. We evaluate this framework by introducing DONUT-NLL, which adapts the training objective of the state-of-the-art trajectory forecasting model DONUT to directly optimize the predictive distribution. Using our policies, DONUT-NLL achieves state-of-the-art results on all metrics of the Waymo motion prediction benchmark.

0

cs.RO 2026-07-02

Failure-aware retry raises robot success rates 17.6% in simulation

by Haoran Hao, Shahram Najam Syed +2 more

FAR: Failure-Aware Retry for Test-Time Recovery and Continual Policy Improvement

Turning failures into preference data enables autonomous recovery and continual gains without human resets.

abstract click to expand

Robot policies inevitably encounter failures when deployed in real environments. Naive retries often repeat the same mistakes, while many existing recovery methods rely on human intervention. In this paper, we propose Failure-Aware Retry (FAR), a framework that enables robots to learn from previous failures at test time, adapt their behavior accordingly, and eventually complete the task autonomously. FAR combines Failure-Contrastive Preference Adaptation, which constructs preference learning data from failures to steer the policy away from previously unsuccessful behaviors, with lightweight action perturbations during retries to encourage local exploration. We further incorporate successful recovery trajectories into a training loop for continual policy improvement. Experiments in both simulation and real-world manipulation tasks show that FAR substantially improves success rates and robustness, with average gains of 17.6% over the standard diffusion policy in simulation and 11.7% in the real world. In addition, FAR significantly improves data efficiency under both reset and timestep budgets during continual policy improvement by exploiting informative failure cases.

0

cs.RO 2026-07-02

Async algorithm cuts robot comms by 96.9% with negligible error

by Adam Pooley, Matthew Hale

Technical Report: Asynchronous Distributed Trajectory Estimation of Multi-Robot Systems

Distributed trajectory estimates converge exponentially and reduce error by up to 64% versus prior methods

abstract click to expand

Distributed trajectory estimation arises in many applications across robotics, but existing implementations typically do not consider asynchrony in agents' communications and computations. Therefore, we propose an asynchronous block coordinate descent algorithm for distributed trajectory estimation. We consider a team of agents that observes a team of robots and estimates their states over a sliding window. The agents solve an approximation of the maximum a posteriori estimation problem, which we derive. We show this approximation introduces negligible errors and eliminates up to 96.9% of communications among agents. Next, we prove that agents' iterates converge exponentially fast to the optimal estimate of the robots' states. Simulations show that this approach has up to 64% less error than a comparable state-of-the-art algorithm. Experiments on mobile robots show the robustness of this approach to delays whose lengths span three orders of magnitude.

0

cs.RO 2026-07-02

Shared GPU serving lifts robot factory output 12 times

by Wenqi Jiang, Jason Clemons +6 more

ROSA: A Robotics Foundation Model Serving System for Robot Factories

ROSA pools server GPUs for entire robot fleets and schedules for total tasks completed instead of single-request speed.

abstract click to expand

Robotics foundation models (RFMs) are making general-purpose robots increasingly practical for factory deployments. While RFM serving systems are central to this vision, existing systems are largely shaped by a single-robot, single-model assumption: inference is treated as an edge-computing problem handled by an on-robot or dedicated nearby GPU, and the serving objective is to minimize the latency of a single action model. In this paper, we propose ROSA, an RFM serving system for robot factories designed around three key principles. First, ROSA adopts shared GPU-pool serving, allowing a fleet of robots to access powerful server-class GPUs over the network in order to improve inference performance, battery duration, and GPU utilization. Second, ROSA provides a robotics-aware programming abstraction and system design that supports multi-model pipelines, per-task performance requirements, and failure handling. Third, ROSA uses factory-objective-driven scheduling to maximize SLO-qualified factory productivity rather than minimizing individual request latency. We implement ROSA on top of Ray Serve for distributed orchestration, with vLLM, PyTorch, and JAX as model-serving backends, and evaluate it on both real robots and synthetic large-scale workloads. The results show that ROSA improves factory productivity by up to 12.06x over conventional dedicated serving systems.

0

cs.RO 2026-07-02

Vision-language model predicts robot pose from camera

by Suraj Borate, Aarav Shah +1 more

Where Am I? Semantic Map Grounding via Vision-Language Models for Multi-Modal Localization

It reaches 96.75 percent full pose accuracy indoors and holds performance on unseen objects or partial maps.

abstract click to expand

We address robot localization in GPS-denied indoor environments by reframing it as a semantic reasoning task rather than a geometric estimation problem. Motivated by how humans localize using object-level cues and labeled maps, we ask whether a vision-language model, given a front camera image, a polar LiDAR scan, and a top-down semantic grid map, can infer the robot pose. We fine-tune Qwen2.5-VL-7B with LoRA and attach a lightweight regression head that predicts continuous pose coordinates (x, y, theta) directly from the final hidden state, bypassing text generation. Training uses a composite position-and-direction loss with curriculum learning on a custom Gazebo dataset of 120,112 samples and 527 scenes. On the in-distribution test set of 18,017 samples, the model achieves 98.23 percent position accuracy, 98.00 percent direction accuracy, 96.75 percent full pose accuracy, a mean position error of 0.11 m, and a mean orientation error of 5.7 degrees at 0.62 s per sample. Position accuracy drops by only 7.2 percentage points on seven unseen object categories, reaching 90.99 percent, supporting semantic spatial reasoning rather than appearance memorization. With incomplete maps, fine-tuning recovers performance to 93.72 percent position accuracy, showing adaptability to stale or partial map information. Two ablations highlight cross-modal complementarity. Without LiDAR, using only camera and map inputs, position accuracy remains 95.06 percent, only 3.2 percentage points below the full system. However, when the camera sees no visible objects in a wall-facing view, LiDAR sustains 92.33 percent position accuracy, compared with 70.74 percent when neither LiDAR nor visible objects are available. This shows that LiDAR becomes the primary localization signal when camera semantics are unavailable and provides a reliable fallback under occlusion or sparse layouts.

0

cs.RO 2026-07-02

ROS 2 middleware faces trade-offs in space

by Sanghoon Lee, Taehun Kim +2 more

The Three Dimensions of ROS 2 Middleware

Spatial abstraction hides network issues while state continuity adds overhead that competes with control timing in wireless settings.

abstract click to expand

ROS 2 (Robot Operating System 2) has emerged as the de facto standard for modern robot software development, with middleware implementations such as the Data Distribution Service (DDS) and Zenoh forming the core infrastructure for distributed robotic communication. Despite their architectural flexibility, these middleware systems exhibit structural limitations, particularly under dynamic and resource-constrained wireless environments. This paper presents a systematic survey of ROS 2 middleware and introduces a conceptual framework to examine its architectural limits through three structural dimensions required by distributed robotic systems, namely Space, Time, and State. We first provide a structured analysis of middleware architecture and operational dynamics, including discovery, data exchange, and state management mechanisms. Building on this foundation, we formalize Time as temporal predictability for control loops, Space as spatial abstraction from physical topology to enable modular deployment, and State as contextual continuity despite dynamic node participation and intermittent connectivity. Through a comprehensive review of existing implementations and prior studies, we organize middleware research according to the structural trade-offs that arise among these dimensions. Under constrained wireless conditions, spatial abstraction can obscure network variability and weaken temporal guarantees, while mechanisms that preserve state continuity introduce computational and network overhead that competes with time-critical communication. These interactions reveal structural trade-offs that characterize the practical limits of contemporary robot middleware. By synthesizing architectural patterns and identifying gaps in current modeling and analysis approaches, this survey outlines a principled research roadmap for robust and scalable robotic middleware architectures.

0

cs.RO 2026-07-02

Human tactile videos pre-train robots for precise manipulation

by Chi Zhang, Penglin Cai +7 more

Human-Centric Transferable Tactile Pre-Training for Dexterous Robotic Manipulation

160-hour egocentric dataset and future-tactile prediction enable transfer with unified spaces and no extra alignment.

abstract click to expand

As an essential modality for dexterous and contact-rich tasks, tactile sensing provides precise force feedback that cannot be reliably inferred from vision. However, limited by hardware and data collection systems, existing datasets with tactility remain small in scale and narrow in contact coverage. Meanwhile, Vision-Language-Action (VLA) models with tactile modality are constrained on dynamics-agnostic post-training, which limits the performance ceiling on downstream tasks. In this paper, we present H-Tac, a large-scale tactile-action dataset with 160-hour egocentric human videos containing more than 300 tasks and 135k episodes. Building upon this, we propose Transferable Tactile Pre-Training (TTP), a system of tactile-based pre-training on human data for fine-grained robotic tasks. To bridge the gap between humans and robots, we use unified tactile and action spaces throughout the pre-training and post-training phases, preserving prior knowledge during human-to-robot transfer. By leveraging a tactile expert for future tactile prediction, our framework explicitly models the contact dynamics and precise physical interactions. Extensive experiments in simulation and on real robots demonstrate that our model achieves superior performance, exhibiting robust generalization and fine-grained manipulation capabilities. TTP paves the way for scalable tactile pre-training via human-to-robot transfer.

0

cs.RO 2026-07-02

Neural simulator matches real robot tests at 0.99 correlation

by Byeongguk Jeon, Seonghyeon Ye +5 more

RoboWorld: Fast and Reliable Neural Simulators for Generalist Robot Policy Evaluation

RoboWorld combines autoregressive video models with task-progress scoring to align evaluations with physical results across tasks and enviro

abstract click to expand

Video world models are emerging as a scalable alternative for evaluating generalist robot policies, bypassing the physical constraints and engineering burdens of real-world deployment. However, evaluating policies with video world models remains challenging, as world-model errors can make generated rollouts unreliable and slow inference limits large-scale throughput. We introduce RoboWorld, an automated evaluation pipeline that pairs a fast autoregressive video world model with a task-progress-aware vision-language model scoring. To enable reliable long-horizon autoregressive world-model rollouts, we propose Step Forcing, which combines anchored and one-step self-forwarded contexts to reduce train--test mismatch while preserving action--observation dynamics. Together, these components enable RoboWorld to align strongly with real-world robot evaluation across tasks and environments, achieving Pearson's r = 0.989 and Spearman's \r{ho} = 0.970.

0

cs.RO 2026-07-02

Robot policies learn stage-adaptive speeds without speed labels

by Qingda Hu, Ziheng Qiu +3 more

AutoSpeed: Annotation-Free Stage-Adaptive Motion Speed Learning for Robot Manipulation

A composite cost over speed-varied trajectory candidates shortens total task time and raises success by matching horizon length to stage dif

abstract click to expand

Different stages of manipulation tasks exhibit varying levels of difficulty, suggesting stage-dependent motion speeds and temporal prediction horizons. However, existing IL-based visuomotor policies typically imitate the execution speed of expert demonstrations and operate with a fixed temporal prediction horizon, limiting flexibility and overall task throughput. In this paper, we introduce AutoSpeed, a model-agnostic learning framework that enables existing visuomotor policies to predict trajectories with stage-adaptive motion speeds, without requiring speed or stage annotations. We treat future trajectories at different speeds as candidate optimization targets, evaluate each candidate using a composite cost that trades off prediction error against prediction horizon, and optimize the policy toward the minimum-cost candidate. With a fixed-length action sequence, speed modulation adjusts the effective temporal prediction horizon: simple stages are executed faster with a longer prediction horizon, whereas complex stages are executed more slowly with a shorter prediction horizon. Specifically, we implement speed modulation in the frequency domain via the discrete cosine transform (DCT), which enables smooth, non-integer speed scaling and thus preserves motion continuity. Extensive evaluations show that AutoSpeed substantially reduces task execution time while also improving success rates. Under the AutoSpeed framework, the inferred motion speeds exhibit a strong correspondence with task stages.

0

cs.RO 2026-07-02

Robots gain 10 points navigation success by asking humans

by Valentino Sacco, Luca Scofano +2 more

Robots Ask the Way: Communication-Enabled Social Navigation

A new communication module lets robots query residents about target locations, matching structured-data performance even with casual speech.

abstract click to expand

Assistive autonomous robots operating in multi-agent environments require efficient strategies to locate specific individuals among multiple residents. Current social navigation methods focus on reactive collision avoidance and trajectory adaptation, but lack mechanisms to proactively gather information through human-robot communication. We introduce Communication-enabled Social Navigation (CommNav). In this novel task, robotic agents actively seek assistance from residents to locate target individuals by requesting information about recent sightings, locations, and movements. To evaluate CommNav, we extend Habitat 3.0 to create Habitat 3.0c, a communication-enabled variant supporting multi-human environments with information exchange protocols. Adding our communication module (COMM) to a state-of-the-art social navigation model yields a 10 percentage-point improvement in Episode Success. We further investigate the transition from structured data to natural language by evaluating models trained on LLM-generated instructions and on colloquial instructions collected from a human study. Our experiments reveal that: (i) explicit human-robot communication substantially enhances multi-person navigation performance; (ii) pre-training COMM on a communication pretext task effectively addresses the challenge of occasional interaction signals; and (iii) the navigation policy is highly robust to natural, colloquial human language, achieving an episode success statistically similar to the model using perfect structured data.

0

cs.RO 2026-07-02

Test-time decay and penalties boost frozen VLN agents

by Shaoheng Zhang, Zhichen Li +1 more

DART-VLN: Test-Time Memory Decay and Anti-Loop Regularization for Discrete Vision-Language Navigation

Reweighting stale memory and penalizing reversals yield shorter paths and higher success on R2R and REVERIE without retraining

abstract click to expand

Memory-based discrete vision-language navigation (VLN) agents must act under partial observability, yet even strong frozen backbones remain vulnerable at test time. Two common failure modes are stale historical evidence at memory readout and inefficient local backtracking during action selection. We present DART-VLN, a training-free test-time control framework for discrete VLN. DART-VLN combines Test-Time Memory Decay, a read-side memory reweighting rule that suppresses stale and redundant evidence without rewriting stored content, with Anti-Loop Regularization, a lightweight next-hop penalty that discourages immediate reversals during action selection. The framework introduces no new learnable parameters and leaves the learned backbone unchanged. Experiments on R2R and REVERIE show a consistent pattern: decay-only provides stable read-side gains, while decay+anti-loop achieves the best overall quality-efficiency trade-off, yielding shorter trajectories, lower runtime, and improved navigation performance in key settings. Behavioral analysis further confirms that anti-loop regularization reduces local backtracking and improves path efficiency under frozen backbones. Overall, the results show that modest test-time control can make memory-based discrete VLN more reliable and efficient without retraining.

0

cs.RO 2026-07-02

Ambush lets slower pursuers catch faster evaders in complex spaces

by Junfeng Chen, Yinhang Luo +3 more

AMBUSH: Collaborative Capture in Complex Environments with Neural Acceleration

Parameterized positions based on topology and visibility enable capture without speed parity, accelerated by neural ranking of search option

abstract click to expand

Collaborative capture of dynamic targets is common in nature as an essential strategy for weaker species against the strong. Similar concepts have shown to be useful for numerous robotic applications, such as security and surveillance, search and rescue. However, most existing works focus on analytical and geometric solutions or end-to-end reinforcement learning methods, which are largely constrained to obstacle-free environments or scenarios with sparse, regularly distributed obstacles. This work tackles the problem from a unique perspective: the renowned strategy of``ambush'' alone would suffice for multiple slower pursuers to capture one faster evader with different levels of intelligence efficiently in complex environments. A parameterized strategy of ambush (including discrete and continuous parameters) is designed first, which takes into account the topological properties of the workspace, the truncated line-of-sight visibility, the relative speed ratio and the limited capture range. Then, a Hybrid Monte Carlo Tree Search (H-MCTS) algorithm is proposed to optimize the associated parameters through long-term planning, enabling the identification of highly promising parameters for future capture. Lastly, the neural acceleration is trained offline to learn the ranking of different choices of parameters across various environments, and to directly predict scores, replacing the rollout process in H-MCTS. The neural acceleration is adopted during online H-MCTS to accelerate the planning procedure while guaranteeing the planning quality. Its efficiency and effectiveness are validated in extensive simulations and hardware experiments, against evaders with different capabilities and intelligence levels, including two-times higher velocity and human-controlled behavior.

0

eess.IV 2026-07-02

Image tilt observations reduce UAV prediction error by 60 percent

by Minxing Sun, Yao Mao

Image-Domain Tilt Constrained Distributed Fusion for Maneuvering UAV Tracking with Multi-Camera Electro-Optical Observations

Apparent roll and pitch from rotorcraft images act as acceleration constraints in asynchronous multi-camera fusion

abstract click to expand

Short-horizon prediction is essential for electro-optical UAV tracking, especially when the target is small, maneuvering, or intermittently observed. Image center, line-of-sight, and range measurements provide direct constraints on target position, but their constraints on acceleration are weak. As a result, prediction can lag during aggressive maneuvers. This paper proposes an image-domain tilt constrained distributed fusion method for maneuvering UAV tracking. The method uses the apparent roll and pitch of a rotorcraft target in the image as low-level maneuver cues. A weak-prior auto-labeling pipeline first generates oriented bounding box and image-domain tilt labels from synchronized video, gimbal IMU, and UAV IMU data. A YOLO-OBB detector is then trained to provide online target position and tilt measurements. The front-end Python implementation is publicly available at github.com/ShineMinxing/PythonYOLO. In the fusion stage, the UAV state is modeled by position, velocity, and acceleration. Image-domain roll and pitch are introduced as acceleration-related pseudo-observations. For distributed tracking, one mobile gimbal camera and two fixed ground cameras are fused asynchronously. Camera attitude error states are augmented into the filter to absorb extrinsic drift and cross-camera systematic inconsistency. A Mahalanobis gate with time-since-last-valid covariance widening is used to reject false detections and handle dropouts. In simulation, adding roll/pitch observations reduces the prediction RMSE from 1.991 m to 0.821 m and decreases the cumulative prediction error by 60.75\%. In real distributed experiments, a self-consistency evaluation shows an 18.10\% reduction in cumulative prediction error. The results show that image-domain tilt can provide useful acceleration constraints for robust short-horizon UAV prediction.

0

cs.CV 2026-07-02

Test-time optimization lifts depth-only 3D segmentation

by Xuying Huang, Sicong Pan +1 more

Privacy-Preserving Depth-Only Open-Vocabulary 3D Semantic Segmentation Via Uncertainty-Guided Test-Time Optimization

UTTO turns uncertainty into a refinement signal and applies foundation model priors for open-vocabulary gains without training or RGB input.

abstract click to expand

Privacy-preserving perception is a critical requirement for deploying 3D scene understanding systems in real-world indoor environments, yet it remains underexplored in open-vocabulary 3D semantic segmentation. Existing methods typically rely on obtaining rich semantic cues from RGB images, which may expose privacy-sensitive visual information. Depth-only 3D geometry provides a privacy-preserving alternative, but the absence of appearance-based semantic cues makes open-vocabulary predictions highly uncertain and less reliable. Under this setting, we propose to convert uncertainty into a guidance signal to identify unreliable semantic responses and use semantic priors from foundation models to regularize their refinement. We present UTTO, an uncertainty-guided test-time optimization framework for depth-only open-vocabulary 3D semantic segmentation. Without additional training, experiments on ScanNet20, ScanNet40, and ScanNet200 demonstrate that UTTO consistently improves depth-only open-vocabulary 3D segmentation and outperforms representative baselines under privacy-preserving conditions.

0

cs.RO 2026-07-02

Bayesian fusion boosts roundabout field of view by 260 percent

by Markos Antonopoulos, Anastasia Bolovinou +3 more

Beyond Line of Sight: Hybrid Validation of V2X Collective Perception in Complex Scenarios

Multi-agent probabilistic grids improve occupied cell recall from 0.82 to 0.94 with six vehicles

abstract click to expand

This paper introduces a probabilistic framework and hybrid validation methodology for V2X-enabled Collective Perception (CP) in complex traffic scenarios. The proposed Bayesian fusion algorithm extends the perceptual horizon of connected and autonomous vehicles by integrating heterogeneous sensor observations from multiple agents into a shared probabilistic occupancy grid. Each cell of this grid encapsulates both occupancy likelihood and uncertainty, enabling explainable and trustworthy situational awareness beyond the ego vehicle's field of view. To bridge the gap between simulation and real-world evaluation, a hybrid testing framework is developed, combining CARLA-based virtual environments with vehicle-in-the-loop experimentation. Experimental results in a roundabout scenario demonstrate a 260 percent increase in field-of-view coverage and a rise in occupied-cell recall from 0.82 (ego-only) to 0.94 (six-agent CP) under nominal localization conditions. Overall, the proposed approach provides a reproducible and interpretable foundation for validating CP systems, supporting the safe and certifiable deployment of cooperative autonomous vehicles.

0

cs.RO 2026-07-02

World models split into two spaces and link to robot actions in four ways

by Xiaoxiong Zhang, Xiong Zeng +1 more

From World Models to World Action Models: A Concise Tutorial for Robotics

Tutorial organizes predictive models by representation type and how forecasts become commands.

abstract click to expand

World models are increasingly used in embodied intelligence and generative simulation, yet their scope remains ambiguous across communities. This tutorial presents a design-space view of world models as action-conditioned predictive models that estimate the future evolution of task-relevant observations or states. We categorize existing methods into observation-space and state-space world models, comparing their trade-offs in visual fidelity, spatial structure, physical interpretability, and control usability. We further introduce world action models, which connect predicted futures with executable robot actions, and summarize four representative paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning. The goal of this tutorial is to clarify the conceptual scope of world (action) models and provide a structured taxonomy for embodied prediction and control.

0

cs.RO 2026-07-02

Distance-field conformal bound certifies trajectory safety uniformly

by Jaeuk Shin, Yoonseok Ra +1 more

From Prediction Uncertainty to Conformalized Distance Fields for Safe Motion Planning

Functional conformal prediction on the full field keeps safety guarantees independent of obstacle count and sampling method.

abstract click to expand

Safe motion planning in dynamic environments requires reasoning about the uncertainty in predicted obstacle motion without sacrificing real-time performance. Existing conformal approaches conformalize a scalar score that aggregates per-obstacle prediction errors, losing spatial coherence and scaling poorly with scene density. We instead conformalize the entire predicted distance field at once. This functional conformal prediction (FCP) framework yields a distribution-free, field-level lower bound, from which safety follows uniformly: any trajectory satisfying the resulting constraint is certified safe, independent of how the control space is sampled. The key enabler is that the residual distance field is empirically low-rank and approximately time-invariant, which makes the bound decomposable in coefficient space. An envelope is fitted offline via functional PCA and a Gaussian-mixture inductive conformal procedure, then refined online by a lightweight adaptive functional conformal (AFCP) update on a low-dimensional vector. This keeps the per-step cost largely insensitive to obstacle count and retains long-run field coverage under distribution shift. We embed the envelope as a tightened safety constraint in a sampling-based model predictive controller, FCP-MPC. On the ETH--UCY pedestrian benchmarks and a dense 3D quadrotor task with up to 280 dynamic obstacles, FCP-MPC attains a favorable balance of safety, feasibility, and efficiency, reaching goals where pointwise and egocentric conformal baselines become too conservative or too expensive, while keeping per-step computation far below online uncertainty-reasoning baselines.

0

cs.RO 2026-07-02

Vision-language models adapt robot positions for changing human groups

by Cong-Thanh Vu, Yen-Chen Liu

Adaptive Companionship for Group-Following Robots: Handling Dynamically Changing Group Formations

Tests across five scenarios report 15 percent higher success and 25 percent fewer collisions than prior methods

abstract click to expand

Accompanying a group of humans is an essential aspect of developing human-like social cognition in robots. However, human groups typically do not follow fixed formations, which poses significant challenges for robots in maintaining natural companionship behaviors. In this paper, we propose an adaptive group-accompaniment method for social robots based on Vision-Language Models (VLMs), leveraging their semantic reasoning capabilities to infer companion positions, maintain social distances, and understand group dynamics. The members of the group are first detected, and a perceptual module generates visual representations of the interaction group space as input to the VLM, which is then combined with a Model Predictive Path Integral (MPPI) controller to ensure stability and safety. Experimental evaluations across five scenarios show that the proposed method enables robots to accompany the group effectively, demonstrating a 15\% improvement in success rate and a 25\% reduction in collision rate compared to baseline approaches. Additionally, a user study indicates that the generated companionship behaviors are perceived as natural and socially appropriate.

0

cs.CV 2026-07-02

Diagnose data or evaluation gap before recording new driving data

by Richard Schwarzkopf, Jonas Merkert +23 more

Creating Impactful Autonomous Driving Datasets: A Strategic Guide from Research Gap to Benchmark

A framework that selects the cheapest operators to close the identified blockage reduces unnecessary collection for labs with limited resour

abstract click to expand

Well-designed autonomous driving datasets have fundamentally shaped research progress, yet existing literature primarily describes what datasets contain rather than how to strategically design impactful ones. This is especially limiting for small and medium-sized labs and startups that cannot afford to misallocate scarce resources. We argue that impactful dataset creation begins with a diagnosis: whether a research question is blocked by a data problem or an evaluation problem, and proceeds by selecting the minimal data operator(s) that closes the resulting gap, recording new data only when no cheaper operator(s) suffices. We analyze the evolution of major autonomous driving (AD) datasets through this lens and distill a strategic framework spanning gap identification, operator choice, sensor suite design, and annotation strategy. We ground the framework in a running case study of our KITScenes dataset family. The datasets are available at: https://kitscenes.com/

0

cs.CV 2026-07-02

Three alignments lift robot mobile manipulation performance

by Ronghan Chen, Yandan Yang +19 more

ABot-M0.5: Unified Mobility-and-Manipulation World Action Model

Latent actions, dual transformers, and dream-forcing training reduce contact gaps and rollout errors on long tasks.

abstract click to expand

Mobile manipulation is a key capability for general-purpose robots, yet remains challenging for current embodied learning methods. VLA policies are typically reactive and lack explicit world modeling, while existing World Action Models (WAMs) are still poorly aligned with the structure of mobile manipulation: they operate on coarse video chunks, model entangled navigation-manipulation actions, and train inverse dynamics under supervision that does not match autoregressive inference. As a result, they often miss fine-grained contact dynamics, suffer from action-distribution conflicts, and accumulate errors over long-horizon rollouts. We propose ABot-M0.5, a new WAM built on the insight that mobile manipulation requires alignment at three levels: temporal granularity, action space, and train-test consistency. To align temporal granularity, we introduce intermediate latent actions that capture local visual state transitions and serve as an bridging action space between video latents and embodiment-specific controls. To align action space, we design a dual-level Mixture-of-Transformers architecture that disentangles both modality representations and heterogeneous action subspaces such as base movement and arm manipulation. To align inference conditions, we propose the dream-forcing training strategy that progressively trains inverse dynamics on model-predicted videos, improving train-test alignment and robustness during autoregressive prediction. Experiments on challenging mobile and fine-grained manipulation benchmarks demonstrate that ABot-M0.5 achieves state-of-the-art performance in both long-horizon task success and finegrained control accuracy. These results highlight the critical importance of granularity-aligned, action-disentangled, and inference-consistent world-action modeling.

0

cs.RO 2026-07-02

Simulated floods reveal route failures missed by static robot maps

by Su Ann Low, Cheng-Hsi Hsiao +4 more

Path Planning in Physically Viable World Models

Augmenting 3D reconstructions with terrain change models lets planners test long-horizon viability before deployment.

abstract click to expand

Robots deployed in unstructured outdoor environments often plan from scene reconstructions collected before deployment because operators cannot remap large or remote sites before every mission. As a result, robots must make long-horizon planning decisions using stale maps that assume the terrain remains unchanged, even though physical changes to the environment may render previously feasible routes unsafe or unreachable at execution time. We present a physically viable world model for evaluating what-if queries for robot navigation under future terrain change. The system augments reconstructed 3D Gaussian splat scenes with physics-based simulation to generate physically modified versions of the same environment without recollecting sensor data or rebuilding the map. We then implement a terrain-aware planner that accounts for physical events, obstacles, and deformations that are simulated by the world model. This allows robots and human operators to evaluate whether planned routes remain feasible before committing to a planned route, particularly in constrained environments where retreat or recovery may become impossible once conditions change. We evaluate the system on a real outdoor field site in Central Texas using simulated flooding across multiple severity levels. We measure route and mission feasibility as terrain conditions deteriorate under physically simulated interventions. Our results show that physically viable world models expose long-horizon route failures and rerouting behavior that are not apparent when planning only on the original reconstructed environment, allowing robots to evaluate how future terrain changes may affect route feasibility before deployment.

0

cs.RO 2026-07-02

One demonstration adapts VLA models to new cameras or arms

by Taewook Kang, Taeheon Kim +2 more

Domain Arithmetic: One-Shot VLA Adaptation under Environmental Shifts

DART isolates domain details via subspace alignment then adds them through weight arithmetic.

abstract click to expand

Vision-Language-Action (VLA) models often fail to perform the same learned tasks under environmental shifts, such as changes in camera pose and shifts to a different but similar robot (e.g., from Panda to UR5e). Adapting these models to the shifted environment (i.e., target domain) often requires training on multiple demonstrations for each task, which are costly to collect. To reduce the burden of data curation and training, we propose an analogy-based method that adapts VLA models under environmental shifts through weight vector arithmetic with domain-specific information addition, named Domain ARiThmetic (DART). Unlike prior approaches, DART requires collecting only a single demonstration, enabling efficient adaptation. To accurately isolate domain-specific information for addition, DART performs subspace alignment between singular components in weight vectors to filter out noisy components. In both simulated and real-world experiments, DART outperforms existing VLA adaptation methods in one-shot scenarios across diverse visual and embodiment shifts. Code is available at https://github.com/snumprlab/dart.

0

cs.RO 2026-07-02

Hierarchical RL controls only vertical speed for UAV landing on wave platforms

by Chun-Kit Li, Iok Long Sit +5 more

WaveLander: A Generalizable Hierarchical Control Framework for UAV Landing on Wave-Disturbed Platforms via Reinforcement Learning

Policy generalizes across unseen wave motions in simulation while low-level controller handles stability.

abstract click to expand

Autonomous landing of unmanned aerial vehicles (UAVs) on wave-disturbed marine platforms remains challenging due to stochastic platform motion, time-varying platform attitude, and uncertain touchdown conditions. Existing model-based methods often require accurate motion prediction and online optimization, while end-to-end learning approaches may suffer from high training complexity and limited interpretability. This paper presents WaveLander, a hierarchical control framework via reinforcement learning (RL) that decouples vertical landing decision-making from low-level flight stabilization. The RL policy maps a compact platform-relative observation to a scalar vertical velocity reference, while a conventional low-level flight controller maintains attitude stability and lateral tracking. This formulation reduces dynamic platform landing to a low-dimensional, timing-aware control problem and enables smooth landing behavior without explicit switching rules. Simulation results under randomized wave-induced platform motions show that WaveLander achieves robust landing performance and generalizes to unseen disturbance conditions, demonstrating the potential of hierarchical learning-based control for marine UAV recovery.

0

cs.RO 2026-07-02

Framework coordinates robot fleets in real time despite delays

by Bo Cao, Zhe Liu +1 more

From Real-Time Planning to Reliable Execution:Scalable Coordination for Heterogeneous Multi-Robot Fleets in Industrial Environments

SCALE reduces motion conflicts and adapts precedence graphs to limit waiting in dense industrial settings.

abstract click to expand

With the increasing deployment of heterogeneous robot fleets in industrial environments, efficient coordination remains a critical challenge. Real-time path planning must simultaneously accommodate high robot densities and heterogeneous motion capabilities, while communication delays, execution uncertainties, and other disturbances may cause robots to deviate from the temporal assumptions underlying planned paths. Such deviations can lead to excessive waiting and congestion propagation across the fleet. This paper presents SCALE, a reactive online coordination framework that enables real-time planning while maintaining robust execution. Within this framework, we introduce a motion-induced conflict reduction mechanism to support the online generation of feasible paths for online conflict resolution. To mitigate the effects of disturbances, we further design a generalized Conjugate Action-Precedence Hypergraph (CAPH) that adaptively adjusts precedence relations among robots. Extensive validation experiments, together with a three-day deployment in a warehouse, demonstrate the

0

cs.RO 2026-07-02

Spring-damper end-effector cuts contact force errors 25-41%

by Rahman Ardakanian, Iman Kardan +2 more

Enhancing Robustness in Robot-Environment Interactions through Passive Compliant Degrees of Freedom: A Hybrid Position-Force Control Approach with Feedback Linearization

Simulations of a 2-DOF arm show lower error variance and smoother torques when passive compliance is added to feedback-linearized hybrid con

abstract click to expand

Robot-environment interactions in dynamic or unstructured settings are often degraded by impact shocks, vibrations, and uncertainties in contact geometry and mechanical properties. This paper proposes an interaction architecture that combines feedback-linearized hybrid position-force control with a passive compliant degree of freedom embedded at the end-effector. Unlike conventional hybrid position-force control, which relies mainly on active feedback, force sensing, and gain tuning, the proposed architecture uses a physical spring-damper interface to store and dissipate impact energy at the contact point before high-frequency shocks propagate to the actuated joints and force-control loop. The approach is evaluated in MATLAB/Simulink on a 2-DOF planar manipulator with three end-effector configurations: rigid, spring-only, and spring-damper. Results under fixed and time-varying interaction conditions show that the spring-damper configuration provides stronger attenuation of contact-induced oscillations, lower force and velocity error variance, and smoother joint-torque response. Representative reductions include 36.5% in fixed-environment tangential force-error standard deviation, 25.4% in variable-environment normal force-error standard deviation, and 41.1% in variable-environment normal velocity-error standard deviation.

0

cs.RO 2026-07-02

Robot advances 30 mm in soil after three worm-like gait cycles

by Lina van Brügge, Shruti Kotpalliwar +3 more

[Preprint] Dynamic Modeling, Gait Synthesis, and Control of a Novel Subsurface Bore Propagator

Modular design with anchor and drill modules achieves incremental propulsion in physics simulation for subsurface tasks.

abstract click to expand

In this article, we present dynamic modeling, gait synthesis, and feedback control design for a modular novel subsurface robot, designed for human-free subsurface exploration and excavation. The subsurface propagator design is based on two major aspects: 1) anchor and propel movement like an earthworm and 2) excavation similar to tunnel boring machines. This design is decoupled into five separate modules: one drill head to excavate and create cavity for propagation, two modules to anchor the robot, and two modules to enable propagation of the body. In order to design a controller for each of the modules, dynamic models using the Euler-Lagrange framework are developed. These mathematical models are used as a baseline to design controlled decoupled operation of the different joint movements. The operation of robotic assembly is constructed via a centralized state machine for gait synthesis with integration of the designed feedback controller. The controllers are tested on the real robot geometry to aid sim-to-real integration: A physics-based Unity simulation using a CAD model of the robot and integration of the trained controller via ROS verifies the performance of the robot. The experimental results demonstrate that the proposed design, controllers and the gait synthesis strategy together are capable of anchoring the robot in place and creating an total advancement of 30\,mm into the soil after completing 3 gait cycles.

0

cs.RO 2026-07-02

Spatiotemporal tubes let robots follow demos on unknown dynamics

by Ratnangshu Das, Puneeth Shankar +3 more

Learning from Demonstration via Spatiotemporal Tubes for Unknown Euler-Lagrange Systems

Heteroscedastic GPs turn demonstrations into time-varying safety envelopes enforced by closed-form control without system identification.

abstract click to expand

We present STT-LfD, a unified Learning from Demonstration (LfD) framework that integrates motion learning with control for unknown Euler-Lagrange systems. Unlike traditional decoupled approaches that track a fixed reference, the proposed method treats demonstrations as a data-driven safety specification. Using heteroscedastic Gaussian Processes, STT-LfD learns Spatiotemporal Tubes (STTs) as an intent envelope that capture time-varying precision requirements of a task. A closed-form feedback controller then enforces these learned constraints while respecting actuator limits, without requiring explicit system identification. The approach preserves the temporal structure of demonstrations, remains computationally efficient, and avoids explicit system identification. Hardware experiments on a mobile robot and a 7-DOF manipulator show that it outperforms baselines in robustness to disturbances and computational speed.

0

cs.RO 2026-07-02

15-point robot success gain registers with users

by Jian Song, Tian Zi +1 more

From Technical Metrics to User Perception: A User Study of a Multimodal Human-Robot Interaction System for Object Detection and Grasping

70 percent of participants preferred the version with upgraded detection and language modules

abstract click to expand

Improvements in the technical performance of human--robot interaction (HRI) systems do not automatically translate into differences that human users can detect during live interaction. This paper investigates whether a 15 percentage point gain in end-to-end task success (from 75% in a multimodal baseline system to 90% in an improved configuration identified through a prior ablation study) is sufficient to produce consistent and measurable differences in user perception. The baseline system combines Whisper for speech recognition, Florence-2 for open-vocabulary object detection, LLaMA 3.1 for action extraction, and an interval Type-2 fuzzy logic controller for motion execution. The improved configuration replaces the perception and language modules with Grounding DINO + SAM and Qwen 3.5 9B, respectively, while retaining the same controller. A within-subject user study with 24 participants compared both systems on the same tabletop object-grasping task. After interacting with each configuration, participants rated perceived speed, reliability, and overall competence and fluency on a 7-point Likert scale. Results show that 17 out of 24 participants (70.83%) preferred the improved system (exact binomial test, p = 0.043, h = 0.43), and all three perceptual constructs were rated significantly higher for the improved configuration after Holm correction, with large to very large effect sizes (p < 0.001). These findings confirm that the identified technical improvements are perceptible to users in direct interaction and underscore the importance of complementing benchmark evaluation with user-centred evidence when assessing robotic manipulation pipelines.

0