cs.CV — Pith

Top Pith

1

cs.GR 2026-05-29

3D scene graph plans portraits before shutter click

by Ruixiang Jiang, Chang Wen Chen

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

Generates human pose, camera, and lighting setups that raters prefer over post-capture baselines while staying physically valid.

abstract click to expand

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

0

Top Pith

4

cs.CV 2026-05-25 2 theorems

Transformer predicts saliency from event camera streams

by Romaric Mazna, Jean Martinet +1 more

Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model

Trained on synthetic data from RGB benchmarks, SEST beats prior event methods and works on real streams without retraining.

abstract click to expand

Saliency prediction has been extensively studied in RGB images and videos as a computational model of human visual attention. In contrast, predicting saliency from event-based data remains largely unexplored, despite the biological inspiration and favorable sensing properties of event cameras. Two obstacles have held this direction back: the absence of large-scale event saliency datasets, and the lack of a strong baseline. In this paper, we introduce SEST (Swin Event-based Saliency Transformer), a transformer-based model for saliency prediction from event data, bridging the data scarcity barrier through event-native pretraining and synthetic supervision. SEST leverages a self-supervised pretrained event-based Swin Transformer backbone combined with a lightweight CNN decoder to produce dynamic saliency maps. To address the scarcity of annotated event-based saliency data, we introduce two new benchmark datasets, N-DHF1K and N-UCF Sports, generated from large-scale RGB saliency benchmarks. Experimental results show that SEST clearly outperforms existing event-based saliency methods and narrows the performance gap with state-of-the-art RGB models. Zero-shot evaluation on a real event camera dataset further demonstrates that our model trained on synthetic data remains transferable on real event streams. To the best of our knowledge, this work is the first to apply deep learning to event-based saliency prediction, opening a new research direction at the intersection of event-based vision and neuromorphic visual attention.

0

Top Pith

3

cs.CV 2026-05-21 2 theorems

Context rewrite lifts 3D grounding accuracy by up to 22 points

by Anna Deichler, Jim O'Regan +5 more

MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

Two-stage pipeline resolves dialogue ambiguity before detecting objects in dynamic VR scenes

abstract click to expand

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

1 0

0

cs.CV 2026-07-03

LLM-planned 3D paths keep object identities stable in long videos

by Hanlin Wang, Hao Ouyang +11 more

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

By separating motion planning from rendering, the system maintains exact visual consistency even after objects leave the frame for extended

abstract click to expand

We present WorldDirector, a highly controllable video world model framework designed for persistent dynamic object memory and unrestricted viewpoint exploration. Unlike existing world models that entangle physical dynamics with pixel rendering and rely on continuous visual observation to sustain motion, our framework explicitly decouples semantic motion orchestration from visual generation. By leveraging an LLM to coordinate 3D trajectories with camera movements and subsequently employing these orchestrated trajectories as control signals for video generation, our approach ensures strict physical logic and appearance stability, successfully preserving the exact visual identities of dynamic entities even when they re-enter the scene after prolonged periods out of view. Experimental results demonstrate that our method supports the synthesis of complex and extended events with unprecedented controllability and persistent dynamic object memory. Project Page: https://worlddirector.github.io/

0

cs.CV 2026-07-03

Three alignments convert any input into consistent 4D

by Qiaowei Miao, Kehan Li +2 more

Alignment Is All You Need For X-to-4D Generation

Align4D matches video and multiview priors through distance search, joint constraints, and decoupled training.

abstract click to expand

Generative diffusion models excel at synthesizing high-quality images, videos, and 3D content under multimodal control. However, arbitrary user-defined modality-to-4D (X-to-4D) generation remains challenging due to the high cost of constructing diverse datasets and the limited scalability of existing methods. This paper presents Align4D, a flexible framework that translates any-modal input into coherent video-3D pairs, using video to guide 4D motion and 3D data to shape 4D geometry. Align4D introduces three key techniques: (1) Object Distance Alignment, which searches Video-Aligned and Multiview-Aligned Object Distances (VAOD/MAOD), respectively, to reconcile 4D renderings with video and the priors of multiview diffusion models; (2) Motion-Geometry Joint Alignment, which constrains known and unknown views through synchronized video and 3D inputs, ensuring consistent 4D generation; and (3) Asynchronous Optimization, which decouples Gaussian attribute and deformation network training to enhance motion and geometry fidelity. We further propose the X4D dataset, which integrates prompt, image, video, and 3D data for benchmarking. Experiments on X4D and Consistent4D demonstrate that Align4D achieves state-of-the-art quality and consistency in X-to-4D generation. Project page: https://miaoqiaowei.github.io/Align4D/.

0

cs.CV 2026-07-03

Plain pixel diffusion on point maps beats latent models for 3D geometry

by Haofei Xu, Rundi Wu +8 more

PointDiT: Pixel-Space Diffusion for Monocular Geometry Estimation

A ViT trained from scratch directly on raw patches yields sharper results without tokenizers or hybrid losses.

abstract click to expand

State-of-the-art single-image 3D reconstruction methods often rely on complex hybrid architectures and loss functions, or compress geometry into latent spaces in order to leverage pre-trained latent diffusion models. In this work, we show that such architectural overhead and intricate loss formulations are unnecessary. We introduce a minimalist pixel-space Diffusion Transformer, built on a plain ViT, that operates directly on raw 3D point map patches and is conditioned on image tokens from a pre-trained DINOv3. Unlike existing latent diffusion approaches, we train our diffusion backbone entirely from scratch, eliminating the need for point map tokenizers. Despite its simplicity, our approach surpasses complex latent-based diffusion models while remaining significantly simpler than hybrid alternatives. Notably, it produces sharper geometric structure and is more robust in highly ambiguous regions, such as transparent objects.

0

cs.CV 2026-07-03

Blocking cross-noise attention leaves or raises diffusion model performance

by Dengyang Jiang, Mengmeng Wang +2 more

From SRA to Self-Flow: Data Augmentation or Self-Supervision?

Dual-timestep inputs without interactions between noise levels match or beat full Self-Flow, pointing to augmentation as the driver.

abstract click to expand

Representation alignment has become an effective way to accelerate diffusion transformer training and improve generation quality. Recent self-alignment methods, such as SRA and Self-Flow, further remove the dependency on external pretrained encoders by constructing alignment within the diffusion model itself. However, the mechanism behind the improvement from SRA to Self-Flow, dual-time scheduling, remains under-examined: Self-Flow attributes its gain to interactions between tokens at different noise levels, where cleaner tokens help infer noisier ones. In this work, we revisit this explanation and ask whether the gain instead comes from data augmentation along the noise dimension. To disentangle these factors, we introduce Attention Separation, which preserves the same dual-timestep input as Self-Flow while blocking attention between tokens assigned to different noise levels. Surprisingly, removing such interaction does not degrade performance and can even improve it, suggesting that the improvement from SRA to Self-Flow mainly comes from data augmentation. Furthermore,We show that Attention Separation itself provides an augmentation effect by splitting a single image into multiple effective training parts to expand the training data. Based on these observations, we combine self-representation alignment with dual-timestep and attention-separation augmentation, and demonstrate the effectiveness of this design on ImageNet.

0

cs.CL 2026-07-03

Reasoning model boosts speaker ID accuracy in TV dramas

by Yuxuan Li, Lingxi Xie +7 more

Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

Multimodal tool-use lets it handle short lines where audio alone fails, on a new 532K-line benchmark.

abstract click to expand

Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: https://www.github.com/198808xc/DramaSR-LRM.}

0

cs.RO 2026-07-03

C++ runtime runs embodied models on any robot hardware

by Ling Xu, Chuyu Han +7 more

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

Five-layer abstraction replaces per-model Python stacks while preserving 91-100 percent task success and cutting memory use by more than two

abstract click to expand

Embodied AI models now span vision-language-action (VLA) models and world-action models (WAMs), but practical deployment remains fragmented across model-specific Python stacks, backend assumptions, and robot-side glue code, especially on heterogeneous edge devices. Existing inference runtimes are designed mainly for request-response serving and therefore do not satisfy the runtime contract of embodied deployment: multi-rate execution inside closed-loop control, latency-first batch-1 inference on heterogeneous hardware, and extensible embodied interfaces beyond fixed token I/O. We present Embodied.cpp, a portable C++ inference runtime for embodied models. Based on an architectural analysis of representative VLA models and WAMs, Embodied.cpp captures a shared execution path and organizes it into five layers: input adapters, sequence builders, backbone execution, head plugins, and deployment adapters. The runtime provides modular multi-rate execution, latency-first fused inference, and extensible operator and I/O support, enabling deployment across heterogeneous devices, robots, and simulators through one backend abstraction. We evaluate Embodied.cpp on two VLA models, HY-VLA and pi0.5, and on a preliminary WAM benchmark using a LingBot-VA Transformer block. The VLA deployments achieve successful closed-loop execution with 100.0% and 91.0% task success rates, respectively. The WAM benchmark reduces block memory from 312.2 MiB to 88.1 MiB. These results show that Embodied.cpp improves deployment efficiency while preserving high accuracy across diverse embodied model architectures.

0

cs.CV 2026-07-03

Spatial memory guides efficient 360° object search and segmentation

by Song Tang, Shuming Hu +3 more

Seek to Segment: Active Perception for Panoramic Referring Segmentation

PanoSeeker folds narrow views into one panorama so an agent can avoid repeat looks and align for an accurate mask.

abstract click to expand

Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($\Delta\theta, \Delta\phi$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.

0

cs.CV 2026-07-03

Tuning a few attention heads blocks text attacks on vision models

by Bohan Liu, Wenqian Ye +4 more

Towards Robustness against Typographic Attack with Training-free Concept Localization

Sampling attribution finds lexical-encoding circuits in ViT; direct weight adjustments raise accuracy on attacked images without retraining.

abstract click to expand

Models trained via Contrastive Language-Image Pretraining (CLIP) serve as the foundational vision encoders for most modern Large Vision Language Models (LVLMs). Despite their widespread adoption, CLIP models exhibit a critical yet underexplored failure mode: irrelevant text appearing within images confounds visual representations, biasing them toward lexical meaning rather than true visual semantics. This robustness issue, commonly described as a Typographic Attack (TA), exposes a vulnerability that poses a significant risk to safety-critical applications such as autonomous driving. To achieve interpretable and effective robustness against TA, we propose a novel, training-free mechanistic interpretability method. Our method provides sampling-based interpretations of hidden state representations and quantitatively attributes semantic versus lexical focus to individual attention heads. Through probabilistic analysis and circuit mining, we isolate specific Vision Transformer (ViT) components that disproportionately encode lexical information, thereby identifying the mechanistic source of TA. We further show that simple interventions applied directly to the identified circuits, without any additional training, can substantially improve robustness against Typographic Attacks in object classification. These interventions, such as selective adjustment of attention weights, also outperform both supervised and training-free defense methods. Our experiments demonstrate that applying the proposed intervention to the vision encoders of several state-of-the-art LVLMs yields substantial gains in Visual Question Answering accuracy under Typographic Attack interference on RIO-Bench. These results confirm both the efficacy and the generalizability of our mechanistic approach. Code is released at https://github.com/Liu-524/SamplingTAR.

0

cs.CL 2026-07-03

Masked prefixes and replay boost VLM accuracy on new images

by Liyan Tang, Fangcong Yin +1 more

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

RL training forces models to correct errors using visual evidence instead of text patterns alone.

abstract click to expand

Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.

0

cs.CV 2026-07-03

Descriptor-free localization cuts rotation error by 89%

by Yejun Zhang, Xinjue Wang +3 more

GeoMix: Descriptor-Free Visual Localization via Global Context and Multi-Detector Training

GeoMix adds global context nodes and trains across detectors to narrow the gap with appearance-based methods while preserving privacy and st

abstract click to expand

Descriptor-free visual localization eliminates high-dimensional descriptor storage, preserves scene privacy, and simplifies map maintenance, yet its accuracy still lags far behind descriptor-based pipelines. We identify this gap to insufficient geometric discriminability in geometry-only matching. Without visual appearance, current methods underutilize local geometry cues, lack the global context among keypoints, and overfit to a single keypoint detector. We further observe that descriptor-free matching naturally enables multi-detector training, as heterogeneous keypoints can be optimized in a shared geometry-only space without aligning descriptor spaces. Building on these insights, we propose GeoMix, a descriptor-free 2D-3D matching framework that strengthens geometric discriminability at three levels. Locally, directional and distance-aware embeddings enrich neighborhood aggregation with fine-grained spatial structure. Globally, learnable context nodes aggregate and redistribute scene-wide information via cross-attention to resolve ambiguities beyond local receptive fields. At the training level, Mix-Training exploits this detector-agnostic geometry space to learn representations across multiple keypoint detectors. Extensive experiments on MegaDepth, Cambridge Landmarks, 7Scenes, and Aachen Day-Night show that GeoMix sets a new state of the art among descriptor-free methods, reducing 75th-percentile rotation error by 89\% and translation error by up to 90\% over the previous best, while generalizing zero-shot to unseen detectors and narrowing the gap to descriptor-based pipelines. Code is available at $\href{https://github.com/YejunZhang/Geomix}{\text{this links}}$.

0

cs.CV 2026-07-03

Entropy filtering removes noise so VLMs keep detail with fewer tokens

by Xuehui Wang, Xuankun Yang +1 more

Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

Vision-language models preserve fine-grained cues under strict token budgets by cleaning scores first then applying spatially-aware submodul

abstract click to expand

Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.

0

cs.CV 2026-07-03

Global priors raise 360° search accuracy nearly eightfold

by Jingtao Xu, Zizhuo Lin +3 more

EAGLE-360: Embodied Active Global-to-Local Exploration in 360^circ

Starting from a full panoramic view and narrowing locally replaces myopic scans and improves error recovery in wrapped environments.

abstract click to expand

While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.

0

cs.CV 2026-07-03

Anchored flow removes clouds while keeping semantics intact

by Ziyao Wang, Maonan Wang +6 more

Interpretation-Oriented Cloud Removal via Observation-Anchored Residual Flow with Geo-Contextual Alignment

By tying reconstruction to the original observation and a vision-model manifold, the method raises both image quality and accuracy on segmen

abstract click to expand

Cloud removal (CR) is essential for optical remote sensing, serving as a prerequisite for reliable downstream interpretation, such as semantic segmentation and change detection. However, existing CR approaches often prioritize visual realism while overlooking their impact on subsequent analytical tasks, leading to semantic drift and degraded downstream performance. To address this issue, we propose Geo-Anchored Cloud Removal (GACR), a unified framework that jointly ensures faithful reconstruction and robust interpretability. At its core, GACR incorporates Observation-Anchored Residual Flow (OAR-Flow), which reformulates CR as a physically grounded residual inversion process. By anchoring the generative trajectory to the cloudy observation rather than pure noise, OAR-Flow enables fast, stable, and faithful reconstruction. To further preserve semantic structures critical for downstream interpretation, GACR integrates Geo-Contextual Prior Alignment (GCPA) to constrain the reconstruction within a semantic manifold induced by a Vision Foundation Model (VFM). Consequently, GACR strictly maintains the spatial-semantic integrity of complex landscapes. Extensive experiments across six CR datasets and twelve downstream tasks demonstrate that GACR produces superior reconstruction quality while consistently improving downstream task accuracy. The code is available at https://github.com/wzy6055/GACR.

0

cs.CV 2026-07-03

Rotation fixes one codebook for all DiT steps

by Donghyun Lee, Jitesh Chavan +6 more

OrbitQuant: Data-Agnostic Quantization for Image and Video Diffusion Transformers

A randomized permuted block-Hadamard transform creates a data-independent basis that serves every timestep and modality.

abstract click to expand

Diffusion transformers (DiTs) achieve state-of-the-art image and video generation, but their multi-step sampling and growing parameter count make inference expensive. Post-training quantization (PTQ) is the natural remedy, yet DiT activations shift across timesteps, prompts, and guidance branches, forcing prior methods to re-fit calibration data for every new checkpoint or modality. We present OrbitQuant, a data-agnostic weight-activation quantizer that bypasses range estimation by quantizing in a normalized, rotated basis. In this basis, a randomized permuted block-Hadamard (RPBH) rotation concentrates each coordinate around one fixed, known marginal regardless of the input, so a single Lloyd-Max codebook serves all timesteps, prompts, and layers of a given input dimension. We extend the same quantizer to weight rows offline, absorbing the rotation into the weights so that it cancels inside each linear layer and only a forward rotation on the activations remains at runtime. The same recipe transfers from image to video with no per-modality tuning. Across FLUX.1, Z-Image-Turbo, Wan 2.1, and CogVideoX, it sets the state of the art for PTQ at several low-bit settings. It also pushes PTQ of image diffusion transformers to W2A4 with usable generation quality.

0

cs.CV 2026-07-03

Multi-expert model cuts OOD false positives on medical scans

by A.S. Anudeep, Vaanathi Sundaresan

MARVEL: Margin-Aware Robust von Mises-Fischer Expert Learning for Long-Tailed Out-of-Distribution Detection

Margin-aware nonlinear von Mises-Fisher experts plus outlier specialist achieve up to 37 percent lower error rates on three datasets.

abstract click to expand

For clinical deployment, it is essential that automated diagnostic systems remain reliable when confronted with previously unseen cases, yet deep models routinely misclassify out-of-distribution (OOD) inputs with high confidence, underscoring the need for more robust OOD detection methods. Although substantial effort has been devoted to improving model robustness, most of the existing literature assumes balanced datasets, evaluates OOD detection on coarse or non-clinical OOD sources, or lacks comprehensive assessment across diverse OOD scenarios. To address the gaps, we propose a novel methodology trained on diverse and imbalanced medical datasets and evaluated across a clinically reflective OOD spectrum. Our framework comprises three key components: (1) a Nonlinear von Mises-Fisher (NvMF) classifier capable of learning non-linear decision boundaries, with theoretical proof of its asymptotic connection to cosine classifiers; (2) a multi-expert framework in which margin-aware NvMF classifiers specialise in different regions of label distribution to better handle imbalance; and (3) an outlier expert trained explicitly to distinguish inlier from outlier data, thereby strengthening OOD detection. Evaluation on RFMiD, ISIC2019, and NCTCRC datasets demonstrates consistent improvements over state-of-the-art methods, achieving mean FPR95 reductions of 8.45%, 13.02%, and 36.90% respectively. These gains are further supported by comprehensive ablations that validated the contributions of each component. This enables reliable identification of unfamiliar cases for deferral to clinicians, supporting safer AI-assisted diagnosis in real-world workflows. Our code is available at https://github.com/redboxup/MARVEL.

0

eess.IV 2026-07-03

Self-auditing drift model leads SSIM in accelerated knee MRI

by Qing Lyu, Jianxu Wang +3 more

Self-Auditing Residual Drifting for Pathology-Preserving Accelerated Knee MRI

It adds per-slice risk scores that flag unreliable outputs while preserving lesion detail at high acceleration.

abstract click to expand

Accelerated magnetic resonance imaging reduces acquisition time, but reconstruction from undersampled k-space can blur diagnostically relevant structures or introduce failures that are not captured by global image metrics. We propose SA-RDM-DC, a Self-Auditing Residual generative Drifting Model with Data Consistency for accelerated knee MRI. The method adapts the newly proposed generative drifting paradigm to accelerated MRI by training a physics-conditioned drift field from the zero-filled reconstruction toward the fully sampled residual correction. It predicts image- and missing-k-space residual corrections, enforces data consistency with acquired k-space, uses frequency-aware and residual drifting supervision to recover fine detail, and produces dense error maps and slice-level risk scores in the same inference pass. We evaluate SA-RDM-DC on multi-coil fastMRI knee data at acceleration factors of 4, 8, and 12, with fastMRI+ pathology annotations for region-level and classifier-based task preservation, and on SKM-TEA for zero-shot and fine-tuned protocol-shift evaluation. Compared with zero-filled reconstruction, UNet-image-SENSE, DC-UNet, Score-Diffusion, ELF-Diff, SENSE-VarNet, and MoDL baselines, SA-RDM-DC achieves the highest SSIM across fastMRI acceleration factors while retaining subsecond per-slice inference and avoiding the long sampling time of iterative diffusion baselines. In pathology-aware analysis, SA-RDM-DC preserves lesion-region structural fidelity and reduces meniscus prediction instability. Its self-auditing scores strongly identify high-error reconstructions on fastMRI and partially transfer as a selective-review signal under SKM-TEA protocol shift. These results support reconstruction evaluation that jointly considers image fidelity, pathology preservation, runtime, and case-specific reliability.

0

cs.CV 2026-07-03

Scene graphs forecast how actions reshape environments

by Francesca Pistilli, Simone Alberto Peirone +1 more

Learning to Evolve Scenes: Reasoning about Human Activities with Scene Graphs

A graph model on evolving scene graphs from first-person videos beats video baselines on retrieval and long-horizon reasoning tasks.

abstract click to expand

Understanding human behavior while interacting with the surrounding world is crucial for many applications of embodied AI. First-person videos are particularly informative for this problem, as they well capture how activities reshape the scene over time. However, existing approaches often rely on implicit visual or language-aligned representations, disregarding structured reasoning over the scene dynamic. We argue that explicit, compositional and editable representations of human-environment interactions can play a crucial role for rich grounded activity understanding. To this end, we introduce SG-Ego, a large scale annotation set extending Ego4D with spatio-temporal scene graphs, where relations triplets are consolidated over time into explicit time-evolving descriptions of the scene state. To reason over this representation, we propose GLEN, a graph-based model that operates over scene graph sequences to both align them with textual actions and model their temporal evolution. In addition, we formulate the activity-driven graph-edit forecasting (A-GEF) problem, a novel task that casts scene dynamics as a sequence of structured transformations conditioned on ongoing actions, enabling explicit reasoning about how scenes change over time. We validate our approach across multiple downstream tasks, spanning retrieval benchmarks as EgoMCQ and EgoCVR, as well as long-horizon reasoning benchmarks as EXPLORE-Bench and the newly introduced A-GEF. GLEN achieves strong results compared to raw video baselines and it excels in reasoning settings, typically addressed only with MLLMs, while enabling controllable and structured predictions of scene dynamics driven by human activities. We believe our results establish spatio-temporal scene graphs, together with models that reason over them, as strong compositional and interpretable representations for video understanding and potentially beyond.

0

cs.CV 2026-07-03

Wavelet compensation strengthens global edits in inversion-free image editing

by Anqi Tang, Wenhao Sun +1 more

Wavelet-Guided Semantic Signal Compensation for Inversion-Free Image Editing

The frequency-aware strategy boosts early semantic signals while keeping background structures intact.

abstract click to expand

Text-guided image editing aims to modify visual content according to a target prompt while preserving the background. Recent inversion-free image editing frameworks such as FlowEdit have demonstrated strong editing capability without requiring inversion. Empirically, FlowEdit can achieve substantial semantic changes under appropriate hyperparameter settings. However, we observe that under certain global attribute shifts, the editing trajectory may not effectively move away from the source distribution in the early timesteps. Our analysis suggests that in the high-noise regime, the dominant manifold-seeking flow toward the data manifold can reduce the influence of the text-conditioned direction, leading to limited global modification while background structures remain only moderately preserved. Inspired by this observation, we propose an inversion-free, frequency-aware semantic compensation strategy that strengthens the effective signal in the early stage of generation, while maintaining structural consistency in the background. The proposed method improves global editing capacity without sacrificing background fidelity.

0

cs.RO 2026-07-03

Model learns intent-driven camera poses from passive video

by Boyang Sun, Jiajie Li +7 more

LIME: Learning Intent-aware Camera Motion from Egocentric Video

LIME mines language intents and view gains from egocentric recordings to train robots on choosing next viewpoints.

abstract click to expand

Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.

0

cs.AI 2026-07-03

SPG-Layout generates plausible 3D rooms in angled spaces

by Xianhui Meng, Zirui Song +11 more

Text-Driven 3D Indoor Scene Synthesis in Non-Manhattan Environments

Statistical priors and large-object-first ordering reduce violations when text drives non-rectangular indoor scenes.

abstract click to expand

Large Language Models (LLMs) have demonstrated remarkable capabilities in 3D indoor synthesis for Manhattan environments. However, existing methods often fail to capture plausible object layout patterns in non-Manhattan settings, primarily because they struggle to model non-orthogonal spatial relationships, leading to high geometric violations and low physical fidelity. To address this challenge, we propose SPG-Layout, a novel text-driven framework designed to generate physically plausible indoor scenes within complex non-Manhattan environments. Specifically, we first utilize statistical priors of object distributions to guide the training process, enhancing environmental understanding and fidelity. Furthermore, mirroring human design workflows, we adopt a hierarchical layout strategy that prioritizes the placement of large objects, thereby substantially minimizing layout violations. By synergizing these components, SPG-Layout achieves a balanced optimization of semantic realism and physical plausibility. To evaluate performance in these complex settings, we constructed a new benchmark comprising 500 diverse non-Manhattan environments. Extensive experiments demonstrate that SPG-Layout consistently and significantly outperforms existing methods across both Manhattan and non-Manhattan environments. The code will be publicly released.

0

cs.CV 2026-07-03

Object LeJEPA beats image LeJEPA on four tasks with 10-100% COCO

by Jakob Geusen, Ender Konukoglu

Object-centric LeJEPA

Fixed SAM masks let the distributional objective align objects instead of scenes, raising tracking, classification, segmentation and re-iden

abstract click to expand

Image encoders trained with LeJEPA can deliver strong features for downstream tasks, but, like other image-level self-supervised methods, typically require large training datasets. Aligning representations at the level of objects rather than whole scenes promises greater data efficiency, but doing this in a completely self-supervised way, effectively jointly partitioning a scene and representing its objects, is unstable: the two are locked in a cyclic dependency, partitioning requires meaningful representations, while meaningful representations require consistent partitioning. We sidestep this instability by taking object masks as given during training, using cheap, off-the-shelf SAM proposals. We extend LeJEPA - whose distributional anti-collapse objective ports naturally from whole images to variable-sized sets of objects - to align object-centric representations rather than whole images. An additional instance-separating loss, which treats other objects in the same scene as negatives, further boosts downstream performance. Across two model scales and 10-100% of COCO, object-level LeJEPA outperforms image-level LeJEPA on tracking (DAVIS), classification (ImageNet-1k), segmentation (ADE20k), and re-identification (NAVI).

0

cs.RO 2026-07-03

Inverse-dynamics check cuts planning compute in world models

by Gawon Seo, Dongwon Kim +1 more

ACID: Action Consistency via Inverse Dynamics for Planning with World Models

ACID penalizes predicted transitions that cannot recover the conditioned action, matching baseline accuracy with far fewer evaluations acros

abstract click to expand

Decision-time planning with action-conditioned world models has become a popular paradigm for embodied control. However, the standard planning cost judges a candidate solely by how close its predicted terminal state lies to the goal, leaving the realizability of the intermediate transitions unchecked -- a predicted trajectory can look convincing while the environment rollout drifts away from it. In this paper, we propose ACID, a decision-time planning framework that introduces cycle action consistency: the action inferred backward from a predicted transition by an inverse dynamics model should recover the one that was conditioned on. We fold this per-step residual into the planning cost via a scale-invariant adaptive weight. Across four action-conditioned world models and six tasks spanning rigid and deformable manipulation, articulated control, and visual navigation, ACID consistently improves planning and matches the baseline's accuracy with substantially less planning compute.

0

cs.CV 2026-07-03

Framework infers visual concepts from image example sets

by Nick Stracke, Kolja Bauer +4 more

Show Me Examples: Inferring Visual Concepts from Image Sets

It generates new images that apply the shared concept to a query, outperforming standard vision-language models on accuracy and diversity.

abstract click to expand

Vision-language models (VLMs) can follow complex textual instructions, yet they struggle to reason from purely visual context. In particular, current models fail to infer shared concepts from sets of example images and apply them to new inputs. We introduce Visual Concept Inference from Sets (VICIS), a task that evaluates this capability. Given a small context set of images sharing a concept and a query image, the model must generate new images that preserve the context-defined concept while remaining consistent with the query. We show that state-of-the-art VLMs perform poorly on this task, often ignoring the visual context or defaulting to biased generations. To address this gap, we propose a training framework and architecture that learn to infer visual concepts from image sets and extract concept-specific embeddings from queries. Experiments on synthetic data and large-scale ImageNet/WordNet data show that our model generates more accurate and diverse outputs and generalizes to unseen concepts and modalities such as sketches.

0

cs.CV 2026-07-03

ViTs gain complexity by specializing layers while keeping tokens linked

by Kaustubh Kapil, Kishor P. Upla

Transformer Geometry Observatory TGO-II: Representational Similarity Observatory

CKA, SVCCA and dimensionality measures show manifold expansion without loss of token interactions across training.

abstract click to expand

While Vision Transformers have achieved remarkable success across computer vision and language applications, the geometric evolution of their internal representations throughout training remains insufficiently understood. Existing analyses primarily focus on attention mechanisms and downstream performance, leaving the evolution of representation geometry largely unexplored. In this work, we present Transformer Geometry Observatory-II (TGO-II), a representation geometry analysis framework designed to investigate how Transformer representations evolve during supervised training. TGO-II analyzes Vision Transformer (ViT-Small/16) representations using Centered Kernel Alignment (CKA), Singular Vector Canonical Correlation Analysis (SVCCA), Two-Nearest Neighbor Intrinsic Dimensionality (TwoNN-ID), and token covariance analysis. Our experiments reveal three key observations. First, both CKA and SVCCA progressively decrease throughout training, indicating increasing representational specialization across Transformer layers. Second, intrinsic dimensionality consistently increases before stabilizing, suggesting progressive expansion of the representation manifold into a larger set of locally accessible degrees of freedom. Third, token covariance and coupling analyses demonstrate that strong token interaction structure persists throughout training, challenging the hypothesis that increasing representational complexity arises primarily from progressive token independence. These findings suggest that representation complexity and layer specialization emerge simultaneously during training. Manifold expansion appears to occur without token decoupling. Together, these observations motivate a new hypothesis in which Vision Transformers increase representational complexity through progressively richer transformations while preserving strong token interaction structure during learning.

0

cs.CV 2026-07-03

Multi-encoder matching sets new one-step ImageNet record

by Lan Feng, Wuyang Li +3 more

Representation Distribution Matching for One-Step Visual Generation

Improved RDM reaches SW_r14 of 1.30 and is preferred over prior best on 71 percent of human-preference comparisons.

abstract click to expand

We elucidate the design space of Representation Distribution Matching (RDM), our name for the paradigm that trains a one-step image generator by matching generated and reference feature distributions under frozen pretrained encoders. We identify two design axes, how the distributions are compared and the representations they are compared in, and controlled studies along them yield three findings. First, the classical MMD, which could not train convincing generators a decade ago, becomes a strong and scalable objective once estimated right. Second, the generated batch is then the operative variable, with an optimum above 2048, far beyond customary batch sizes. Third, any single representation can be gamed, driven below the real score while images stay visibly fake, so we match against a balanced battery of encoders and evaluate with SW_r14, a Sliced-Wasserstein distance over 14 encoders that is independent of the training loss and resists gaming. Combining the preferred choices yields improved RDM (iRDM): it sets the one-step state of the art on ImageNet at SW_r14 1.30, corroborated by PickScore, a human-preference proxy our objective never optimizes, which prefers it over the prior best one-step generator on 71.2% of matched samples. The same recipe post-trains the four-step FLUX.2 [klein] into a one-step generator, surpassing the four-step version on GenEval, 0.826 to 0.794, and on PickScore, 22.76 to 22.58, in 90 H200 GPU-hours. Project page: https://alan-lanfeng.github.io/rdm/.

0

cs.CV 2026-07-03

RGB-only training produces infrared and polarimetric novel views

by Federico Lincetto, Gianluca Agresti +3 more

Learning Spectral and Polarimetric Clues for One-to-Multimodal Novel View Synthesis

Multimodal pre-training teaches correlations that let the model output other modalities when fine-tuned on RGB alone

abstract click to expand

Neural rendering techniques allow for accurate reconstruction of the geometry and color appearance of 3D scenes. Some methods have extended their use to additional imaging modalities, such as multispectral, infrared, or polarimetric data. However, all of these approaches require expensive sensors and calibrated setups to capture new multimodal frames for each new scene. We propose Spectral and Polarimetric Implicit Learned Representation (SPoILeR), a novel method to obtain multi-view consistent renderings of unconventional modalities for scenes where either only RGB frames or very few of the additional modalities are available. Thanks to a multimodal pre-training phase, the model learns the mutual correlation between different modalities. This step allows predicting accurate renderings of unconventional modalities during a fine-tuning phase supervised only by RGB images. Experimental results show that the approach can accurately render infrared, polarimetric, and multispectral frames for scenes where no input sample captured by these types of sensors is provided.

0

cs.CV 2026-07-03

On-device AI app locates personal items for visually impaired users

by Cristian-Gabriel Florea, Stelian Spînu

VisionAId: An Offline-First Multimodal Android Assistant for People with Visual Impairment, Featuring Personalized Object Retrieval

VisionAId registers user objects via few-shot photos and guides retrieval with AR markers, spatial audio and haptics on standard Android pho

abstract click to expand

Over 285 million people worldwide live with a visual impairment, for whom everyday tasks such as avoiding obstacles, locating personal belongings, recognizing familiar faces, or handling cash remain persistent obstacles to personal autonomy. Existing assistive applications are typically limited to recognizing predefined categories, depend heavily on cloud connectivity, or require dedicated hardware. We present VisionAId, an Android application that turns a commodity smartphone into a real-time visual assistant. The system integrates six on-device deep learning models (metric monocular depth estimation, instance segmentation, visual and facial embeddings, face detection, and a custom banknote detector) running entirely through ONNX Runtime, with an optional cloud large language model (Google Gemini Flash) used only for narrative scene description and automatic object labeling. A distinctive contribution is a few-shot pipeline for personal objects: the user photographs an object from several angles, and the system later locates that specific instance in the environment, guiding the user toward it with augmented-reality markers, spatial audio, and distance-proportional haptics. All feedback is multimodal (Romanian speech synthesis, voice commands, vibration). On a reference device (Samsung Galaxy S21 Ultra), INT8 quantization reduces depth latency from ~1200 ms to ~491 ms, the custom banknote detector reaches an mAP@50 of 0.986, and metric depth is calibrated to below 1 cm of error within 3 m.

0

cs.CV 2026-07-03

Attention modules boost spacecraft 6D pose sensing from RGB

by Yonglong Zhang, Yang Liu

GAP-GDRNet: Geometry-Aware Monocular Visual Pose Sensing on a Single-Target Synthetic Spacecraft Dataset

Placing refinement before geometric prediction and self-attention in Patch-PnP handles weak textures on synthetic data.

abstract click to expand

Monocular relative pose sensing is a central perception problem in non-cooperative rendezvous and on-orbit servicing. In spacecraft images, however, weak surface texture, thin appendages, illumination changes, and partial occlusion often leave only sparse and unstable geometric evidence. This article presents GAP-GDRNet, a geometry-aware attention-enhanced framework for monocular RGB-based 6D pose sensing. The method follows the geometry-guided direct regression paradigm of GDR-Net and modifies two points in the pipeline: an attention-based feature refinement (AFR) module is placed before dense geometric prediction, and a patch-level geometric self-attention (PGSA) module is inserted into Patch-PnP. AFR reinforces global spacecraft structure together with local weak-texture cues; PGSA then relates downsampled geometric patches before final pose regression. A Blender-based annotation process supplies target masks, visible-region masks, dense model-coordinate maps, camera intrinsics, and 6D pose labels for supervised training.

0

cs.RO 2026-07-03

Hybrid moving-static camera data cuts VLA shortcut learning

by Jincheng Tang, Yilong Zhu +3 more

The Moving Eye: Enhancing VLA Spatial Generalization via Hybrid Dynamic Data Collection

Continuous motion plus diverse static views enables generalization to new poses where extra fixed viewpoints alone fail, across ACT, Diffusi

abstract click to expand

Vision-Language-Action (VLA) models have shown remarkable promise in generalized robotic manipulation. However, their spatial generalization remains fragile. We argue that simply increasing the number of viewpoints is insufficient. Models often fall into the trap of Shortcut Learning, latching onto spurious correlations (e.g., fixed relative poses between objects or between the camera and robot base) rather than learning true spatial relationships. In this work, we propose a data-centric solution to enhance VLA spatial generalization. We utilize a dual-arm setup where one arm performs manipulation while the other serves as a mobile environmental camera. We systematically evaluate three data distribution patterns: Fixed, Multi-Fixed, and Moving Views. Our findings reveal that a hybrid strategy, combining continuous camera motion with diverse static viewpoints, yields the best performance by substantially reducing spurious correlations while maintaining training stability. Our experiments demonstrate that this strategy mitigates spurious correlations, enabling VLAs to generalize to unseen camera poses and object configurations where simply adding more static viewpoints fails. Crucially, we reveal that the susceptibility to shortcut learning and the struggle with spatial generalization are universal characteristics shared across diverse architectures. Consequently, all evaluated models (ACT, Diffusion, and VLA models including Pi0 and Gr00t) benefit significantly from our mixed data strategy.

0

cs.CV 2026-07-03

Evolutionary search creates videos that best activate brain regions

by Yingtian Tang, Sogand Salehi +4 more

NEvo: Neural-Guided Evolutionary Video Synthesis for Dynamic Visual Selectivity

Guided by encoding model predictions, the stimuli outperform handcrafted localizers and expose temporal sensitivity differences across pathw

abstract click to expand

The human brain processes dynamic visual input through hierarchically organized, functionally specialized regions. While recent in silico brain encoding models can synthesize optimal stimuli to probe selectivity in different brain regions, prior work has been largely limited to static images, leaving dynamic visual processing underexplored. We introduce a novel neural-guided video synthesis framework that generates stimuli optimized for target brain regions across visual cortex. Our method performs evolutionary search over a structured prompt space, guided by a dynamic encoding model that predicts voxel-level responses to video inputs. By maximizing predicted activity for a target ROI, the framework efficiently discovers hyper-activating dynamic stimuli that consistently surpass handcrafted localizer videos. The synthesized videos recover known selectivities across ventral, dorsal, and lateral pathways, and further reveal systematic differences in sensitivity to temporal dynamics. A searchlight analysis provides new insight into the progression toward increasingly complex social-dynamic features along the lateral stream, further supported by probing with synthesized abstract, non-naturalistic stimuli. Taken together, our framework enables in silico exploration of dynamic visual selectivity, with new predictions for in vivo experiments

0

cs.CV 2026-07-03

Feed-forward model predicts 3D Gaussians with materials from images

by Polina Karpikova, Wenjing Bian +3 more

InvSplat: Inverse Feed-Forward Scene Splatting

Geometry and reflectance parameters are recovered jointly in one pass, supporting relighting without per-scene optimization.

abstract click to expand

Inverse rendering aims to recover both 3D geometry and physically meaningful material properties from images, enabling applications such as relighting and novel view synthesis. Optimization-based methods achieve high fidelity but require costly per-scene fitting, while image-space learning-based approaches often suffer from multi-view inconsistencies and lack an explicit 3D representation for stable novel view rendering. We present a feed-forward multi-view reconstruction framework for inverse rendering that directly predicts a structured 3D Gaussian representation with intrinsic material attributes. Each Gaussian primitive is parameterized by mean, normal, opacity, rotation, scale, albedo, metallic, and roughness, enabling a disentangled and physically grounded scene representation. Our model integrates priors from a material estimation network with a multi-view 3D reconstruction backbone, allowing joint prediction of geometry and reflectance parameters in a single forward pass. Experiments on synthetic and real-world datasets demonstrate improved multi-view consistency compared to 2D baselines, accurate material recovery, and stable novel view rendering. Our representation further supports physically-based relighting and more faithful modeling of view-dependent effects compared to existing RGB-based feed-forward reconstruction methods. Our project webpage is: $\href{https://poliik.github.io/invsplat/}{\text{https://poliik.github.io/invsplat/}}$.

0

cs.CV 2026-07-03

Optimized synthetic scenes expose 10x more VLM errors in cars

by Lev Sorokin, Chen Yang +2 more

Search-based Testing of Vision Language Models for In-Car Scene Understanding

ISU-Test searches scene parameters to generate test cases that reveal incomplete or wrong outputs from vision-language models.

abstract click to expand

In the automotive domain, in-car scene understanding (ISU) enables the detection of safety-critical events, such as driver distraction, and supports drivers or passengers by analyzing the in-car scene and adapting the environment (e.g., ambient lighting). The industry is increasingly exploring vision-language models (VLMs) to interpret camera-recorded in-car scenes and extract information for downstream reasoning tasks. However, VLMs may generate incomplete, erroneous, or misleading scene descriptions, highlighting the need for systematic testing. Collecting real in-vehicle data is costly, difficult to scale, and often infeasible, particularly in early design stages. In this paper, we present ISU-Test, an automated testing approach that combines rendering-based scene generation with search-based testing to evaluate ISU systems. By framing testing as an optimization problem and systematically modifying scene parameters, our method generates diverse in-car scenarios and explores a wide range of configurations. We evaluate ISU-Test on both an industrial prototype and open-source VLMs across two case studies: question answering and captioning, comparing against randomized scenario generation. Results show that ISU-Test significantly outperforms the baseline, achieving up to 10 times higher failure rates and up to 3.6 times higher failure coverage.

0

cs.CV 2026-07-03

Selective state units keep change maps stable across new domains

by Yuzhi He, Junxi Huang +2 more

Dual-Selective Network for Domain-Incremental Change Detection

DSINet filters domain noise in Mamba-based models to retain prior spatial structures at linear cost.

abstract click to expand

Domain-incremental change detection (DICD) continuously adapts models to new geographic domains while preserving prior knowledge. However, a structural mismatch exists: the label space remains fixed while domain characteristics vary drastically. Consequently, incremental models struggle to maintain stable spatial change representations across domains. Existing strategies, such as replay-based or regularization-based methods, often fail to scale to long domain sequences, leading to knowledge degradation or increased computational cost. We propose Dual-Selective Incremental Network (DSINet), a unified framework built on visual state space models. DSINet leverages Mamba's input-dependent selective mechanism through a selective spatial state unit (S3U). This unit preserves stable spatial change structures while filtering domain-specific variations during feature propagation. As a result, spatial representations remain stable across domains, preventing the accumulation of feature confusion over incremental steps. Additionally, we employ a concentration-balanced distillation (CBD) strategy to stabilize knowledge transfer across domains. It balances hardness and confidence concentration effects during incremental updates. This ensures reliable probability mass allocation and prevents over-smoothing or mode collapse during distillation. Together, these mechanisms maintain stable learning dynamics throughout incremental stages. Experimental results demonstrate that DSINet mitigates knowledge degradation across long domain sequences while maintaining the linear computational efficiency of state space models.

0

cs.RO 2026-07-03

Low-cost drone runs real-time face tracking and line following

by Andrei-Marian Ungureanu, Stelian Spînu

Real-Time Visual Intelligence on Low-Cost UAVs: A Modular Approach for Tracking, Scanning, and Navigation

Modular AI on DJI Tello combines detection, recognition and depth estimation for tracking, scanning and navigation tasks.

abstract click to expand

Autonomous drones are rapidly transforming modern warfare and civil applications alike. This paper presents the development of an integrated intelligent drone system designed to serve as a personal assistant. Leveraging the DJI Tello drone platform, we implemented a modular architecture that integrates three core artificial intelligence functionalities: facial detection, facial recognition, and depth estimation from monocular vision. A web-based interface enables seamless drone control and real-time video monitoring, while a Python-based server processes visual data and executes inference pipelines using lightweight neural models optimized for embedded systems. Unlike existing commercial solutions, this system emphasizes accessibility, low-cost hardware, and open-source technologies. The system demonstrates robust performance in real-world conditions, including person tracking, indoor scanning, and autonomous line following using virtual sensors. This project validates the applicability of advanced AI techniques in real-time robotic systems and illustrates the feasibility of deploying them on constrained hardware, providing a foundation for future research in autonomous UAVs for military, rescue, and surveillance missions.

0

cs.LG 2026-07-03

Distribution rewards improve FID scores on generative models

by Ruihang Li, Mengde Xu +5 more

Optimizing Visual Generative Models via Distribution-wise Rewards

Evaluating full sample distributions rather than individuals cuts FID-50K from 8.30 to 5.77 on SiT while preserving diversity.

abstract click to expand

Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.

0

cs.CV 2026-07-03

1.2M multidisciplinary samples train AI on accurate diagrams

by Zhaokai Wang, Mingxin Liu +12 more

DisciplineGen-1M: A Large-Scale Dataset for Multidisciplinary Visual Generation and Editing

Dataset from math to sports supplies paired images with controllable differences to improve knowledge-grounded generation and editing.

abstract click to expand

Recent image generation and editing models can produce visually appealing natural images, yet they remain unreliable when the target image is a knowledge-intensive diagram whose correctness depends on disciplinary concepts, symbolic structure, and precise spatial relations. We introduce DisciplineGen-1M, a million-scale multidisciplinary dataset that supports text-to-image generation and image editing. It contains 1.2M samples spanning mathematics, physics, chemistry, biology, geography, computer science, economics, history, music, and sports. To construct the dataset, we design a scalable framework that combines vector-graphics rendering, OCR-based editing, curated programmatic synthesis, and large-scale text-to-image filtering. These pipelines produce captions, editing instructions, structured annotations, and paired images with controllable semantic differences. Building on DisciplineGen-1M, we further introduce a discipline-informed reasoning-generation model for both text-to-image generation and image editing. Experiments on discipline-related benchmarks, GenExam and GRADE, show substantial improvements over open-source baselines, while evaluations on general reasoning-informed benchmarks, WISE and RISE, further indicate broader transfer. The results suggest that large-scale structured academic visual data is a key ingredient for moving image generation from aesthetic plausibility toward verifiable knowledge-grounded visual creation. We will publicly release our dataset, model, and source code of the data curation pipeline to ensure reproducibility and benefit future research.

0

cs.CV 2026-07-03

Flow matching replaces textual inversion for zero-shot image retrieval

by Zhenqi He, Ziqi Jiang +4 more

FlowCIR: Semantic Transport via Flow Matching for Zero-Shot Composed Image Retrieval

A small module learns to transport reference-conditioned instructions to target embeddings using ten times less training than earlier method

abstract click to expand

Zero-shot composed image retrieval (ZS-CIR) aims to retrieve a target image by editing a reference image with a natural-language instruction, without relying on domain-specific annotated triplets. Most existing ZS-CIR methods rely on textual inversion to translate the reference image into pseudo-text tokens and then compose them with the instruction via simple concatenation in the text space, which can be lossy and brittle for fine-grained semantics. In this work, we propose a new paradigm, namely FlowCIR, that casts ZS-CIR as conditional semantic transport between reference and target embeddings. Leveraging \emph{conditional flow matching}, our model learns a lightweight transport field that maps the instruction representation toward a target-aligned query embedding conditioned on the reference image. Since FlowCIR operates on pre-extracted VLM embeddings and trains only a small transport module without updating the image or text encoder, it offers a computationally efficient training protocol compared with prior textual-inversion-based approaches. The resulting framework is training-efficient, requiring roughly $10\times$ fewer training resources than prior textual-inversion-based approaches. We further identify negation and removal as a major failure mode of VLM-based composition. To address this, we propose an inference-only Multi-Negative Steering strategy that steers a negation-containing relative instruction away from its negated semantics, mitigating the limited negation handling of VLMs and improving robustness on negation-heavy queries. Extensive experiments on standard CIR benchmarks demonstrate that FlowCIR achieves strong and competitive performance compared with recent ZS-CIR methods.

0

cs.CV 2026-07-03

Mixing augmentations raise vein accuracy but lower attack resistance

by Haiyang Li, Yuming Fu +5 more

AGVBench: A Reliability-Oriented Benchmark of Data Augmentation for Vein Recognition

Benchmark finds that best clean performers are often poorly calibrated and vulnerable to adversarial examples.

abstract click to expand

Vein recognition is a secure biometric technology often constrained by limited annotated data and imaging variations. While data augmentation mitigates this, strategies designed for natural images may disrupt the fine-grained topology and textures essential for identity discrimination. We present AGVBench, which evaluates 30 representative augmentation strategies on five public palm- and finger-vein datasets with seven backbone architectures, covering classic CNNs, vision transformers, and vein-specific recognition models. Our results show that multi-image mixing methods (e.g., MixUp, PuzzleMix, StarMixup) generally provide the strongest recognition performance. However, they are often poorly calibrated and vulnerable to adversarial perturbations, revealing a clear inconsistency between clean accuracy and adversarial security. We also find that severe geometric transformations frequently degrade recognition, which is potentially due to feature misalignment or spatial cropping, and that augmentation effectiveness varies across palm and finger vein datasets. These findings prove that accuracy-centric evaluation is insufficient for biometric augmentation. AGVBench provides standardized protocols to support reproducible research and guide the design of reliable, secure, and robust vein recognition systems. Our codebase is available at https://github.com/Advance-VeinTech-Innovators/AGVBench.

0

cs.CV 2026-07-03

Specialized benchmark reveals VLM failures on video grounding

by Rintaro Otsubo, Ryo Fujii +7 more

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Models struggle with animal behavior, surgery and industry clips in both zero-shot and example-based tests

abstract click to expand

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to adapt to novel domains is essential. To bridge this gap, we introduce AnyGroundBench, a domain-adaptation benchmark designed to shift the STVG evaluation paradigm from static zero-shot testing to rigorous domain adaptation. Targeting five specialized domains (animal, industry, sports, surgery, and public security), AnyGroundBench pairs newly captured videos such as expert-annotated mouse behaviors with established datasets, unifying them through dense, high-fidelity spatio-temporal annotations. Crucially, the benchmark provides dedicated training subsets to systematically measure domain adaptability. We extensively evaluate 15 state-of-the-art VLMs, assessing their zero-shot generalization and In-Context Learning (ICL) capabilities under practical computational constraints. Ultimately, our findings reveal that current models fail in both zero-shot and ICL-based adaptation when confronted with specialized domains, exposing critical flaws in spatio-temporal reasoning that future research must address.

0

cs.CV 2026-07-03

ArcAD forms compact normal boundaries from limited samples

by Ningning Han, Lei Fan +6 more

ArcAD: Anomaly-Rectified Calibration for Cold-Start Supervised Anomaly Detection

Push-pull calibration on a hypersphere lets reconstruction detectors handle cold-start conditions with scarce normals and few anomalies.

abstract click to expand

The deployment of Industrial Anomaly Detection (IAD) in real-world manufacturing frequently encounters a challenging cold-start bottleneck, in which limited normal samples fail to represent the full normal distribution and only a few anomalies are available. Under such a regime, existing methods struggle to form compact normal boundaries and fail to effectively exploit supervised signals from rare defects. To address this challenge, we propose Anomaly-Rectified Cold-start AD (ArcAD), a plug-and-play calibration framework for reconstruction-based IAD baselines. ArcAD follows a push-pull learning paradigm to construct a compact and discriminative normal boundary under data scarcity. On the one hand, ArcAD projects limited normal samples onto a hypersphere and pulls them into multiple compact clusters to maximize coverage of the normal manifold. On the other hand, it synthesizes pseudo-anomalies on the hypersphere and leverages real anomalies to push the boundary inward and sharpen anomaly discrimination. Extensive experiments on MVTec-AD, VisA, Real-IAD, and MANTA demonstrate that ArcAD significantly outperforms state-of-the-art supervised and unsupervised methods in both single-class and multi-class settings under cold-start conditions. Code is available at: https://github.com/LGC-AD/ArcAD.

0

cs.CV 2026-07-03

Token compression breaks at high rates while pruning holds in ViT segmentation

by Tien-Phat Nguyen, Ngai-Man Cheung

When Token Compression Breaks: Structural Pruning vs. Token Reduction for Robust ViT Segmentation under High Compression

Structural pruning degrades smoothly and a prune-then-merge strategy improves accuracy-robustness at aggressive compression on corrupted inp

abstract click to expand

Vision Transformers (ViTs) are strong backbones for semantic segmentation, but their computational cost limits deployment. Recent token compression methods for efficient transformer-based segmentation reduce this cost by decreasing the number of tokens. However, existing evaluations primarily focus on low-to-moderate compression, leaving their behavior under aggressive compression and corrupted inputs unclear. Meanwhile, structural pruning provides an orthogonal route to efficiency by removing redundant components in the ViT architecture, but is rarely compared to token compression under a unified protocol. To bridge this gap, we benchmark representative token compression and structural pruning methods for ViT-based semantic segmentation under matched FLOPs on ADE20K and Cityscapes, together with their common-corruption variants ADE20K-C and Cityscapes-C. Our results reveal a consistent trend on both clean and corrupted inputs: token compression is highly effective at mild reductions but degrades sharply when compression becomes severe, consistent with substantial information loss from overly aggressive token reduction. In contrast, structural pruning exhibits a smoother degradation curve and is more stable at high compression. Motivated by these findings, we study a prune-then-merge pipeline that applies moderate token compression on top of a moderately pruned backbone. At comparable FLOPs, this combined strategy consistently achieves a better accuracy-robustness trade-off at high compression, offering a practical recipe for deployment-oriented ViT segmentation. Code is available at https://github.com/phatnguyencs/vit-seg-compression.

0

cs.CV 2026-07-03

Thresholds tune waste AI accuracy against human review cost

by Mohammed Fahad Ali, Dominique Briechle +3 more

Efficient Waste Sorting for Circular Economy: A Confidence-guided comparison between One-Vs-All and One-Vs-Rest Classification Strategies with Human-in-the-Loop for Automated Waste Sorting

OvA and OvR compared on Goslar dataset to flag uncertain items while cutting overall sorting errors

abstract click to expand

The complexity of waste disposal regulations across European countries poses significant challenges for the residents and hinders the transition to a Circular Economy. In Germany, the proper sorting and disposal of household waste remains challenging across municipalities. Consequently, substantially reducing incorrectly disposed waste is vital for improving waste management and advancing the Circular Economy. AI-based waste sorting solutions can support residents through user-friendly tools, such as mobile applications, that guide proper waste disposal. To be effective in supporting the Circular Economy, however, these solutions must be configurable to reflect the specific waste sorting scheme of individual municipalities in Germany. In the scope of this work, an evaluation and analysis are performed of two prominent classification strategies: OvA and OvR. The research uses a dataset constructed in alignment with the waste categories and sorting scheme of the city of Goslar in Germany. Moreover, this work aims to extend beyond the overall performance by examining the behavior of OvA and OvR classification strategies in identifying samples likely to be misclassified. These classification strategies are compared by applying varying confidence thresholds to identify uncertain samples for subsequent human review. This evaluation aims to balance the number of misclassifications against the human effort required for data annotation.

0

cs.CV 2026-07-03

DetailAnywhere generates fashion close-ups from focus markers

by Zijun Li, Yimin Zhou +13 more

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

Cross-modal distillation aligns diffusion transformer branches to produce identity-preserving details without text prompts.

abstract click to expand

Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchasing decisions for apparel, consumers also desire the freedom to examine specific detail regions of interest, such as collars, cuffs, and fabric textures, yet existing methods have not explicitly studied this setting. We therefore formalize a new, non-template task: Fashion Detail Generation with focus conditioning, and release FDBench, the first benchmark comprising 40K+ human-verified reference-detail pairs across 41 different categories. This task poses a unique semantic gap challenge: the model must bridge the correspondence between a focus marker on a product reference image and a photorealistic close-up view of the indicated region, while faithfully preserving the garment's identity, without any precise prompt. To bridge this gap, we propose Cross-modal Feature Alignment Distillation (CFAD), which leverages a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space via dual-branch distillation. To further improve consistency between generated details and reference images, we introduce a consistency reward model that jointly scores image pairs along three quality axes and optimizes generation via reinforcement learning. Experiments show that our model DetailAnywhere significantly outperforms all state-of-the-art opensource methods across all metrics and human evaluations.

0

cs.CV 2026-07-03

Backprop-free method reaches 0.48 Dice on thyroid nodule US

by Mohammad Amanour Rahman

MedSaab-US: A Backpropagation-Free Multi-Scale Wavelet-Saab Framework for Thyroid Nodule Segmentation in Ultrasound Images

MedSaab-US uses multi-scale wavelet and Saab transforms plus XGBoost for CPU inference under 500K parameters on TN3K data.

abstract click to expand

Deep learning (DL) methods dominate thyroid nodule segmentation in ultrasound (US) images, achieving high Dice scores but at the cost of millions of parameters, GPU-dependent training via backpropagation, and limited mathematical tractability. These limitations impede deployment in resource-constrained environments. In this paper, we propose MedSaab-US, a backpropagation-free segmentation framework grounded in the Green Learning paradigm. MedSaab-US extracts multi-scale spatial-frequency features by combining multi-level Discrete Wavelet Transform (DWT) with multi-scale channel-wise Saab (Subspace Approximation with Adjusted Bias) transforms at patch sizes of 5 x 5, 11 x 11, and 21 x 21 pixels. Label-Assisted Greedy (LAG) feature selection retains the most discriminative features, which are fed to an XGBoost classifier for pixel-wise prediction. The Saab transform parameters are determined analytically from data statistics, while XGBoost employs iterative greedy tree construction without requiring backpropagation. Evaluated on the TN3K dataset (2,879 training and 614 test images), MedSaab-US achieves a mean Dice coefficient of 0.4784 +/- 0.2190, precision of 0.5768, and recall of 0.5604, with a model footprint under 500K parameters and CPU-only inference in approximately 0.3 seconds per image. We present this result as an exploratory non-DL baseline for thyroid ultrasound segmentation and analyze the specific challenges posed by isoechoic nodules. An ablation study further quantifies the contribution of each pipeline component, including separate evaluations of LAG feature selection and training-set size.

0

cs.CV 2026-07-03

Radiomics hybrid segments medical images with 9.5x fewer parameters

by Mohammad Amanour Rahman

RadiomicNet: A Hybrid Radiomics-Guided Lightweight Architecture for Interpretable Medical Image Segmentation

GLCM and LBP features routed through an attention gate improve Dice scores over larger baselines on ultrasound and polyp datasets.

abstract click to expand

Deep learning has achieved remarkable performance in medical image segmentation, yet it suffers from critical limitations: mathematical intractability, substantial parameter requirements, and lack of clinical interpretability. We propose RadiomicNet, a novel two-stream hybrid architecture that enhances standard deep learning by integrating handcrafted radiomics features directly into the segmentation learning process. The key contribution is the Radiomics Attention Gate (RAG), which leverages Gray-Level Co-occurrence Matrix (GLCM) and Local Binary Pattern (LBP) features to modulate skip-connection attention in a lightweight MobileNetV2-based encoder-decoder, providing ante-hoc interpretability without post-hoc approximations. A novel Radiomics Consistency Loss further enforces alignment between texture complexity and prediction uncertainty, reducing Expected Calibration Error (ECE) from 0.142 to 0.118. RadiomicNet achieves a Dice Similarity Coefficient (DSC) of 0.763 +/- 0.231 on the Breast Ultrasound Images (BUSI) dataset and 0.854 +/- 0.112 on Kvasir-SEG, outperforming U-KAN by 1.2% and 1.8%, respectively (p < 0.05, Wilcoxon signed-rank test), with only 3.27M parameters, 9.5x fewer than standard U-Net and 4.3x fewer than U-KAN. Gradient-based feature importance analysis reveals that GLCM dissimilarity (15.24%), GLCM energy (14.56%), and LBP entropy (11.49%) are the dominant radiomics cues, providing clinically meaningful explanations for segmentation decisions. The proposed approach demonstrates that compact, interpretable models grounded in domain knowledge can deliver state-of-the-art segmentation performance with substantially reduced computational overhead.

0

cs.CV 2026-07-03

QLoRA cuts energy 20-30% for fine-tuning on 2GB GPUs

by Altay Toktassyn, Jurn-Gyu Park

Efficient PEFT Methods with Adaptive Checkpointing for Vision Models and VLMs on Resource Constrained Consumer-GPUs

Adaptive checkpointing lowers peak memory 43-79% while DINOv2 exceeds fine-tuned accuracy at lower energy on CIFAR-100 and DTD.

abstract click to expand

Modern pretrained vision models achieve strong accuracy but demand substantial GPU memory for fine-tuning, making edge deployment impractical. This paper compares five parameter-efficient fine-tuning (PEFT) methods (Full FT, LoRA, AdaLoRA, QLoRA, BitFit) on Transformers- (ViT-Small, TinyViT) and Mamba-based vision backbones (Vim-Small, MambaVision-T) under an on-device VRAM budget (e.g., 2 GB), together with three gradient-checkpointing strategies (none, static, and a proposed memory-budget-aware adaptive algorithm); and we evaluate three families of foundation-model baselines: zero-shot contrastive vision language models (OpenCLIP, SigLIP), self-supervised vision backbones with lightweight evaluation protocols (DINOv2), and autoregressive VLMs for prompt-based classification (PaliGemma, MobileVLM, SmolVLM). Experiments on CIFAR-100 and DTD report accuracy, training time, energy, and the NetScore family of multi-objective metrics, which we extend with two deployment-aware variants. QLoRA and BitFit cut energy 20-30% at a 1-2% accuracy cost; the adaptive algorithm reduces peak memory 43-79% with 9-30% energy overhead. DINOv2 surpasses fine-tuned models on CIFAR-100 (0.917 vs. 0.897) at a fraction of the energy, while small autoregressive VLMs remain uncompetitive.

0

cs.CV 2026-07-03

Single CT scan yields pose-controllable patient twin

by Han Zhang, Boyang Zhao +1 more

Patient-Specific Articulated Digital Twins from a Single Full-Body CT Scan

Fitting a body model to the scan binds bones and organs so new poses can be simulated while skeletal geometry stays intact.

abstract click to expand

Patient-specific anatomical models provide individualized context for surgical planning, image-guided intervention, and algorithm development. However, most CT-derived models are static: they preserve the body configuration captured at scan time, but cannot represent how the same anatomy would appear after patient repositioning. This limitation is especially important for radiographic imaging, where appearance depends jointly on imaging geometry and patient pose. We present a proof-of-concept for constructing a patient-specific articulated digital twin from a single full-body CT scan. The method fits a parametric human body model (SMPL) to obtain a patient-aligned kinematic scaffold, binds segmented bones and organs to an anatomy-aware rig, and retargets body-pose changes while preserving skeletal geometry. On three full-body CT subjects, the fitted scaffold achieved 15.8 $\pm$ 4.0 mm chamfer distance and 95.9 $\pm$ 1.8% skeletal enclosure. Recomposition at the acquisition pose preserved major radiographic structure, with overall SSIM of 0.872 $\pm$ 0.016 and PSNR of 18.5 $\pm$ 1.4 dB across paired DRRs. Across unseen target poses, the resulting twins enabled articulation while maintaining high skeletal enclosure (94.4 $\pm$ 0.4%). As a feasibility demonstration, we render the articulated twin as pose-dependent DRRs. These results suggest the feasibility of extending static, view-controllable CT simulation toward pose-controllable anatomical twins for future synthetic imaging and positioning studies.

0

cs.CV 2026-07-03

Eight part tokens encode motion for any skeleton

by Yuhao Zhang, Gerard Pons-Moll +1 more

SAMoR: Motion Modelling for Articulated Objects of Any Skeleton and Topology

SAMoR compresses arbitrary rigs into a shared vocabulary, cutting cross-topology reconstruction error by 5.8 times.

abstract click to expand

Modeling motion for articulated objects of arbitrary skeleton topology remains difficult: existing motion generators target a fixed human skeleton, and prior adaptations either fail to share a vocabulary across rigs or discard motion detail through global pooling. Our key observation is that while joint-level motion does not correspond cleanly across species, motion of functional joint groups does: a human arm, a wolf foreleg, and a bird wing share motion structure despite differing joint counts and connectivity, a correspondence that joint names (e.g., "forearm", "wing_L1") partially expose even when topology does not. We introduce SAMoR (Skeleton-Aware Motion Representation for Articulated Objects), a cross-topology motion representation that encodes each motion segment as a small fixed number ($K=8$) of part tokens shared across arbitrary skeletons. A graph-transformer encoder consumes per-joint motion features, kinematic graph structure, and joint-name embeddings, then compresses them into part-level tokens via cross-attention pooling and residual vector quantization, yielding a discrete motion codebook shared across rigs. To keep the part queries from collapsing into redundant global representations, we introduce a topology-agnostic attention supervision loss, with joint-name dropout to reduce over-reliance on text labels. We curate a heterogeneous corpus from HumanML3D, Truebones Zoo, and animated Objaverse-XL assets, and evaluate SAMoR on held-out characters with unseen skeletons. It supports accurate reconstruction and cross-topology transfer, and enables text-conditioned generation and part-wise editing via a MaskGIT token generator. SAMoR reaches $2.75 \times 10^{-2}$ normalized MPJPE on cross-topology reconstruction, $5.8\times$ below the strongest adapted variable-$J$ tokenizer baseline, while remaining competitive with fixed-skeleton specialists on HumanML3D.

0

cs.LG 2026-07-03

Stacking ensemble flags early Alzheimer's from ADNI records

by Debopriya Ghosh

Predicting Early Stages Of Alzheimer's Disease And Identifying Key Biomarkers Using Deep Artificial Neural Network And Ensemble Of Machine Learning Methodologies

After fixing missing values and imbalance, the model ranks biomarkers while comparing classifiers on standard accuracy measures.

abstract click to expand

Alzheimers disease (AD) is a brain disorder that develops slowly and mainly affects memory, thinking, language, and daily activities. It is one of the most common causes of dementia and creates many difficulties for patients as well as their families. In the early stage, the symptoms are often mild and may look like normal ageing. For this reason, many people are diagnosed late, when the disease has already progressed. At present, there is no complete cure for AD. Still, early detection can help doctors manage the condition better and take suitable steps at the right time. In this study, a machine learning model is proposed to detect the early stages of Alzheimers disease using clinical details, neuropsychological test scores, and neuroimaging-related measures. The data used in this work is collected from the Alzheimers Disease Neuroimaging Initiative (ADNI). As the dataset has missing values, iterative imputation is applied to fill them. The dataset also has class imbalance, which is handled using Borderline SVM-SMOTE. After that, feature selection is carried out using wrapper-based and embedded methods so that only important features are used for training. The selected features are divided into training and testing sets, and feature scaling is applied. A stacking ensemble model is developed using Logistic Regression, Extra Trees, Bagging KNN, and LightGBM as base classifiers. Along with this, an artificial neural network is also trained on the same dataset. The performance of these models is compared using precision, recall, F1-score, and AUC-ROC. This study aims to find the best classifier and also identify important biomarkers that may help in the early diagnosis of Alzheimers disease.

0

cs.CV 2026-07-03

Similarity map warps images to lift zero-shot counting

by Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

AdaCount: Training-Free Similarity-Guided Spatial and Feature Adaptation for Zero-Shot Object Counting

AdaCount reallocates resolution and amplifies features so SAM3 counts small objects better in dense scenes from text alone.

abstract click to expand

Zero-shot object counting (ZOC) aims to count instances of arbitrary object categories specified only through textual prompts. Recent training-free approaches leverage foundation models such as SAM to reformulate counting as a prompt-driven segmentation task, eliminating the need for costly counting-specific training data with point-level annotations. More recently, SAM3 introduced promptable concept segmentation, enabling the zero-shot segmentation of all instances corresponding to a text-defined concept. However, SAM3 struggles in densely populated scenes containing numerous small objects, where limited image resolution and insufficient attention to target-relevant regions often lead to missed instances and poor instance separation, hindering accurate object counting. To address this limitation, we propose AdaCount, a training-free framework for ZOC based on similarity-guided spatial and feature adaptation. AdaCount first estimates a prototype-driven similarity map that identifies target-relevant regions. This similarity map subsequently guides two complementary adaptations: (i) similarity-guided spatial warping, which reallocates image resolution toward target instances, and (ii) feature modulation, which amplifies target-relevant encoder representations. Together, these adaptations enable SAM3 to devote greater representational capacity to target-relevant regions while preserving global image context, without requiring any model retraining. Extensive experiments across six diverse counting benchmarks establish AdaCount as a new SOTA among training-free ZOC approaches.

0

cs.CV 2026-07-03

Synthetic pipeline boosts real archival film restoration

by Miko{l}aj Jastrzębski, Dawid Glinkowski +4 more

AbsoluteDegradation: A Physics-Inspired Synthetic Film-Degradation Pipeline and Archival Film Restoration Benchmark

Models trained on the modular physics-based degradations generalize better to actual footage and the 81k-frame benchmark reveals current fai

abstract click to expand

Restoring archival film remains a fundamentally challenging problem due to the absence of paired training data and the lack of standardized evaluation benchmarks. Pristine versions of deteriorated footage are physically unrecoverable, requiring supervised methods to rely on synthetic data that often fail to capture the complex, temporally coherent nature of real film degradation. At the same time, existing real-world datasets are limited in scale, quality, and accessibility, hindering reliable evaluation and fair comparison across methods. We address both limitations with AbsoluteDegradation, a physics-inspired, modular pipeline for synthesizing realistic film degradations, and a new large-scale archival benchmark. The proposed pipeline models the analog-to-digital process as a structured composition of artifact families, incorporating signal-dependent grain, parametric scratches, and temporally coherent camera motion, enabling controlled generation of diverse degradation regimes. In parallel, we introduce a curated dataset of 81,576 high-resolution frames sourced from real archival footage, designed for consistent evaluation under real-world conditions. Together, these contributions provide a unified framework for training and benchmarking restoration models. Extensive experiments across multiple architectures show that models trained with AbsoluteDegradation generalize better to real-world footage, while the proposed benchmark reveals systematic failure modes of current methods. We hope this work establishes a foundation for reproducible and domain-authentic evaluation in archival film restoration.

0

eess.IV 2026-07-03

Deep learning matches experts in penis MRI segmentation for 34k scans

by Jan Ernsting, Gunnar Paul Kordes +6 more

Population-Scale Segmentation of Penile Tissue in DIXON MRI using Deep Learning for Quantitative Phenotyping in Male Reproductive Health

Observer-level accuracy enables automated penile tissue volumetry in 34,412 UK Biobank participants.

abstract click to expand

Penile measurement is clinically relevant across male reproductive and urogenital health, including conditions such as micropenis, congenital and endocrine disorders, and sexual or urinary dysfunction. However, quantitative assessment of penile size has relied mainly on external length or circumference measurements, which are difficult to standardize, sensitive to measurement conditions, and unable to capture the internal portion of the penis. MRI enables volumetric assessment of the whole penis in vivo, but automated segmentation has not previously been established at population scale. Automated whole-organ volumetry would enable high-throughput phenotyping for multi-omics and clinical studies of male reproductive disease. Here, we present a deep learning framework for whole-penis segmentation in multi-channel DIXON MRI. Using a newly curated expert-annotated training dataset ($n = 145$ subjects; $13,050$ annotated slices) and a double-annotated independent test benchmark ($n = 24$ subjects; $2,160$ double-annotated slices), we optimized a 3D nnU-Net architecture. The model achieved a 5-fold cross-validation Dice score of $0.90$ and performed at observer-level accuracy on the independent test set (Dice: $0.92$; Hausdorff distance: $3.58$). We deployed the model in $34,412$ UK Biobank participants, enabling automated quantification of total penile tissue, including both external and internal components. Longitudinal evaluation in 2,282 men demonstrated high inter-session reproducibility ($r = 0.87$). This framework establishes a reproducible and population-scalable method for MRI-based assessment of penile anatomy and provides an open technical resource for future studies in urological imaging and male reproductive health. The trained model weights will be publicly released.

0

cs.CV 2026-07-03

Gaussian splatting turns single panoramic X-ray into 3D dental volume

by Tomasz Szczepański, Szymon P{l}otka +3 more

X-Splat: Gaussian Splatting for 3D CBCT Generation from Single Panoramic Radiograph

Primitives initialized along X-ray paths recover teeth and mandibular canal missed by NeRF and GAN baselines.

abstract click to expand

Generating a 3D dental volume from a single panoramic radiograph (PXR) could provide a low-radiation alternative to Cone-Beam Computed Tomography (CBCT), but the problem is highly underdetermined: panoramic acquisition integrates 3D attenuation along curved X-ray paths into a 2D image, leaving depth-resolved anatomy unobserved. Existing implicit and generative approaches often produce oversmoothed geometry or anatomically inconsistent hallucinations, lacking geometry-driven supervision and relying on smooth representations unable to precisely localize sharp anatomical boundaries. We propose X-Splat, the first Gaussian Splatting framework for generating CBCT-like 3D dental volumes from a single PXR. X-Splat uses the known panoramic acquisition geometry as a generation scaffold: learnable anisotropic Gaussian primitives are initialized along the X-ray paths that formed the input image and adjusted in a single feed-forward pass, constrained by Beer-Lambert reprojection and multi-view radiographic training supervision. A lightweight residual refiner adds dataset-level anatomical priors without overriding the geometry already resolved by the Gaussians. We train on synthetic PXR-CBCT pairs, enabling direct volumetric supervision without paired real scans. We further introduce segmentation-based geometry-aware metrics, providing the first evaluation of PXR-based generation over maxillofacial anatomy. X-Splat outperforms NeRF- and GAN-based baselines, recovering individual teeth, cortical boundaries, and alveolar structure, including the mandibular canal which prior methods fail to reconstruct. Code will be available at https://github.com/tomek1911/X-Splat

0

cs.CV 2026-07-03

Windowed batch multiplication speeds large-kernel convolutions

by Wan Song, Wei Zhou +5 more

WBMM: Windowed Batch Matrix Multiplication for Efficient Large Receptive Field Convolution

By partitioning inputs into windows and using bias tables for weights, it turns irregular memory access into regular batched matrix multipli

abstract click to expand

Large kernel depthwise convolutions achieve strong performance but suffer from significant degradation as kernel size grows due to irregular memory access from gather-based computation; while Large Kernel Acceleration (LKA) helps on small feature maps, it becomes counterproductive on large feature maps, even slower than non-accelerated implementations. We propose Windowed Batch Matrix Multiplication (WBMM), which partitions input into contiguous windows and indexes a compact relative position bias table to construct weight matrices, enabling regular memory access via batched matrix multiplication. This yields a unique property: WBMM's throughput improves with larger windows, opposite to depthwise convolutions that degrade with larger kernels. Operator-level benchmarks show WBMM with 14x14 windows outperforms 5x5 depthwise convolution baselines in speed while providing a 7.8x larger per-layer receptive field. Combined with inter-block cross-window communication and hierarchical window reparameterization, WBMM achieves comparable or higher accuracy on ImageNet-1K, COCO, and ADE20K with 1.31-1.88x training speedup, and demonstrates consistent advantages across GPU, CPU, and edge devices without requiring specialized acceleration kernels. Our code is available at http://github.com/wansong-s/WBMM

0

cs.CV 2026-07-03

Existing models fail on 45-minute egocentric referring tasks

by Shunya Kato, Taiki Miyanishi +4 more

LongEgoRefer: A Benchmark for Long-Form Egocentric Video Referring Expression Comprehension

A benchmark built from Ego4D clips shows extreme sparsity and complex transitions defeat current Video REC approaches.

abstract click to expand

Egocentric videos capture rich and diverse human-object interactions and have emerged as a fundamental resource for understanding human activities related to objects. In this context, Video Referring Expression Comprehension (Video REC), the task of localizing the temporal and spatial extent of a referred object in video frames given a natural language query, plays a key role in linking textual descriptions to observed objects in untrimmed egocentric recordings. However, existing egocentric Video REC benchmarks primarily focus on short video clips, where some target object appears densely within frames. Such settings do not reflect real-world egocentric recordings, which are long-form, untrimmed, and characterized by sparse object occurrences and complex activity transitions. To address this limitation, we introduce LongEgoRefer, a novel and challenging benchmark constructed from long-form videos in the Ego4D dataset. LongEgoRefer contains 1,498 referring expressions with an average video duration of 45 minutes. The benchmark exhibits extreme target sparsity, detailed linguistic descriptions, and complex human-object interactions embedded in long, dynamic egocentric narratives. Consequently, it defines a demanding spatio-temporal grounding problem that requires models to identify both when an event occurs and where the referred object appears within extended video sequences. We evaluate existing Video REC approaches, including training-free baselines based on vision-language models combined with Grounded SAM2. Extensive experiments show that even advanced baselines and current state-of-the-art models struggle significantly on LongEgoRefer. These results highlight the intrinsic difficulty of long-form egocentric spatio-temporal grounding and emphasize the need for more robust video understanding models.

0

cs.CV 2026-07-03

Multimodal fusion lifts breast tumor classification to 77.64% accuracy

by Chuxi Nan, Di Wu +5 more

Multimodal Fusion for Fine-Grained Classification of Breast Fibroadenoma and Phyllodes Tumors

Ultrasound images plus clinical attributes and diagnostic text outperform imaging-only baselines on the fibroadenoma-phyllodes distinction.

abstract click to expand

Breast fibroadenoma (FA) and phyllodes tumor (PT) are fibroepithelial breast lesions with highly overlapping appearances on B-mode ultrasound, making benign and borderline PT prone to being misclassified as FA and complicating preoperative decision-making. Existing computer-aided diagnosis methods commonly rely on single-modal imaging features and insufficiently exploit complementary clinical and textual information. To address this limitation, we construct the FAPT-M Dataset, a pathology-confirmed multimodal dataset comprising 910 patients with strictly reviewed ultrasound images, structured clinical attributes, and ultrasound diagnostic descriptions. Based on this dataset, we propose a clinically guided multimodal framework that integrates DenseNet-based visual encoding, CLIP-inspired text encoding, and lightweight clinical encoding, and further introduces clinical-conditioned adaptive modulation, cross-modal Transformer fusion, and dual-path representation learning to improve feature alignment and multimodal interaction. Under patient-level five-fold cross-validation, the proposed method achieves an accuracy of 77.64%, F1-score of 73.38%, and AUC of 89.74%, outperforming representative CNN-, Transformer-, and vision-language-based baselines. Ablation studies and class-balanced evaluations further confirm the contribution of three-modality fusion and the key architectural components. Overall, this work provides an effective multimodal approach for fine-grained FA-PT classification and establishes a high-quality benchmark for multimodal breast ultrasound analysis.

0

cs.CV 2026-07-03

Real-time AR overlays virtual cards on TCG streams from ordinary cameras

by Anthony Cioppa, Antoine Verdonck +3 more

TCG-AR: Real-Time Multi-View Augmented Reality for Trading Card Game Streaming

Detects, orients, and identifies cards then renders models across views and outputs to OBS without markers or special hardware.

abstract click to expand

Trading card games are increasingly played and broadcast online, yet live streams remain mostly limited to flat top-down footage of the playing area. Augmenting such streams with virtual models of the played cards would improve the viewing experience, but most existing systems rely on instrumented playing surfaces and embedded chips, which are costly and impractical for casual players and large-scale events. In this work, we present TCG-AR, a novel real-time pipeline that augments trading card games using ordinary RGB cameras alone, without any physical markers or specialized hardware. Our pipeline detects, orients, and identifies the cards on the board, renders virtual content onto each card across all views, and can additionally compose a broadcaststyle view that summarizes the game state for spectators, streaming the augmented feeds to standard broadcasting software such as OBS. To train the detection, orientation, and identification models without manual labeling, we introduce an automatic procedure that generates annotated synthetic training data from a reference set of card images. Then, we evaluate several trained models on a new manually annotated dataset with real images, analyzing performance and runtime throughput that determine real-world usability. Overall, by relying only on commodity cameras and hardware, and by open-sourcing all code, models, and datasets, this work aims to serve as a reference for real-time trading card recognition and to make real-time augmented-reality streaming accessible to the broader community of players and streamers.

0

cs.CV 2026-07-03

Vision-language models predict eye scanpaths with 46% higher accuracy

by Susmit Agrawal, Matthias Bethge +1 more

DeepGaze3.5-VL: Modeling Scanpaths via Autoregressive Token Prediction

Discretizing fixations into text tokens lets pretrained models add viewer and task conditioning without custom modules.

abstract click to expand

Understanding human visual attention on a scene over time has applications in domains such as interface design and inferring cognitive states. Modeling visual scanpaths has historically relied on specialized architectures with hand-crafted priors. While these architectures can model fixation sequences, their rigid structural biases restrict easy extendability and flexible conditioning. For instance, integrating task-specific instructions or adapting to distinct viewer identities requires custom, disjoint architectural additions. We frame scanpath prediction purely as a discrete sequence modeling task. By mapping coordinates into a text vocabulary, we leverage the pretrained representations of Vision-Language Models. This framing absorbs diverse factors of variation: simple prompting allows for global conditioning, such as providing viewer identities to capture personalized biases, or task-specific objectives like visual search. The framework can also integrate per-fixation attributes, such as individual fixation durations, alongside spatial locations. The autoregressive alignment enables the scalable, exact computation of per-fixation log-likelihoods, directly equivalent to the commonly used Information Gain (IG) metric. Our model, DeepGaze3.5-VL, establishes a new state-of-the-art across multiple datasets, achieving 2.18 bits of IG on MIT1003, a 46% improvement over DeepGaze III. This advantage persists even when baselines use identical high-capacity vision encoders. Beyond predictive performance, our generative framework serves as a powerful computational tool for direct behavioral interventions, allowing for controlled in-silico simulations that would be experimentally difficult or impossible to conduct in vivo. We demonstrate this ability by performing controlled interventions on the durations of pre-saccadic fixations, recovering known oculomotor phenomena purely from data.

0

cs.CV 2026-07-03

Hand control works on egocentric video from monocular footage

by Yushuo Chen, Xiaoyu Shi +4 more

HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control

A Plucker Hand Map and filtered monocular reconstructions let generators handle diverse real scenes without lab capture setups.

abstract click to expand

We present HandsOnWorld, a framework for hand-controlled egocentric video generation that forgoes multi-view and marker-based motion capture, learning instead from unconstrained monocular video. Such generality is bottlenecked by the scarcity of scalable 3D hand annotations: large egocentric corpora lack finger-level labels, whereas precise hand datasets are confined to narrow, instrumented settings, limiting prior hand-controlled generators to restricted scene distributions. We instead annotate 3D hands directly on in-the-wild egocentric video through monocular reconstruction, introducing a protagonist-centered annotation pipeline that filters the reconstructions at the action-semantic, image-quality, and 3D-geometric levels to build EgoVid-Pro, a dataset of clean, protagonist-only hand trajectories spanning 103K clips and roughly 12M frames across diverse everyday scenes. To resolve the camera-hand entanglement induced by large ego-motion, we further propose the Pl\"{u}cker Hand Map, a 3D-aware control signal that extends Pl\"{u}cker-ray representations from camera rays to the hand surface, disentangling camera and hand motion at the representation level. Experiments show that \method surpasses prior hand-controlled generators in reconstruction fidelity and control accuracy, and generalizes to out-of-distribution everyday scenes beyond the laboratory datasets on which prior methods rely.

0

cs.CV 2026-07-03

Voxel-based LiDAR detectors more vulnerable to attacks than pillar-based

by Adwait Chandorkar, Kai Krink +3 more

Comprehensive Robustness Analysis of LiDAR-based 3D Object Detection in Autonomous Driving

A new framework shows recent high-capacity models fail robustness tests as badly as older ones, urging benchmarks that check security too.

abstract click to expand

Recent advancements in LiDAR-only 3D object detection have demonstrated improved detection accuracy over benchmark datasets. However, the adversarial robustness of these models remains untested. Very few adversarial robustness studies exist for LiDAR-only 3D object detection and unfortunately, even they are limited to legacy models. Moreover, there is a systemic gap in the existing evaluation frameworks that rely simply on mAP ignoring other structural and predictive factors. To fill this gap, we propose a holistic framework that evaluates adversarial robustness using two structural factors (point cloud density and point cloud localization) and three predictive factors (misclassification, localization error, distance from ego). Using this framework, we perform an empirical study and critical analysis on recent and legacy state-of-the-art models using adversarial attacks specifically designed for LiDAR-based models. Our key finding is that high-capacity, voxel-based detectors are more susceptible to structured coordinate perturbations than pillar-based detectors. Additionally, non-anchor-based detectors demonstrate poor adversarial robustness, which necessitates rethinking model training techniques. Overall, our results demonstrate that recent models are as vulnerable to adversarial attacks as their predecessors. Therefore, we argue that there is a need to improve the evaluation benchmarks for 3D object detection that not only reward architectural modifications for improving detection accuracy, but also evaluate whether the design choices improve adversarial robustness.

0

cs.LG 2026-07-03

Stricter splits expose hidden failures in spatial data

by Prathamesh Patil, Arpit Jain +1 more

Beyond the Performance Illusion: Structure-Aware Stratified Partitioning and Curriculum Distributionally Robust Optimization for Spatially Correlated Domains

Structure-aware partitioning and curriculum robust training cut leakage and improve calibration on correlated domains.

abstract click to expand

Performance evaluation in AI systems commonly assumes that random dataset splits produce independent and identically distributed (i.i.d.) subsets. We show that this assumption often breaks down in spatiotemporally correlated domains such as aerial surveillance, precision agriculture, and medical imaging, leading to two systematic failures: data leakage, where correlated samples span training and validation splits and inflate performance estimates, and hidden stratification, where errors on minority subpopulations are obscured by aggregate metrics. To address these issues, we propose a unified evaluation and training framework for spatially correlated data. We introduce Structure-Aware Stratified Partitioning (SASP), which constructs validation splits that reduce spatiotemporal leakage while preserving meaningful class balance, and Curriculum Distributionally Robust Optimization (CDRO), a curriculum-based relaxation of distributionally robust training that stabilizes optimization under these stricter splits. Across multiple benchmarks, this combination yields consistently improved generalization, more reliable confidence calibration, and exposes failure modes that remain hidden under conventional random-split evaluation.

0

cs.CV 2026-07-03

Multiple prototypes model intensity variation for precise medical segmentation

by Yuqi Liu, Yufei Chen +3 more

Embracing Intra-Class Heterogeneity for Semi-Supervised Medical Image Segmentation: From Diversity to Precision

MPCL captures intra-class heterogeneity via aligned prototypes and knowledge transfer, outperforming prior methods especially with scarce la

abstract click to expand

Due to the scarcity of expert-annotated data, Semi-Supervised Medical Image Segmentation (SSMIS) has emerged as a promising approach. Many anatomical structures in medical images exhibit significant intra-class heterogeneity, with different regions showing heterogeneous intensity patterns within the same structure. However, existing methods inadequately exploit this intensity-manifested intra-class heterogeneity, resulting in uniform structural representations and imprecise segmentation. Furthermore, the scarcity of labeled data makes it more difficult to effectively capture such complex heterogeneity. To address this, we propose Multiple Prototype Contrastive Learning (MPCL), an SSMIS framework that possesses better diversity and better precision. It consists of three novel designs: First, we provide structural representations with better diversity and propose Intensity-aligned Heterogeneous Prototype Generation (IHPG) that effectively models intra-class heterogeneity by generating multiple prototypes aligned with intensity characteristics. Second, we further enhance more diverse structural representations and build a solid foundation for more precise segmentation through Prototypical Space Optimization (PSO) that systematically optimizes a more discriminative and generalizable prototypical space. Finally, we achieve segmentation results with better precision through Dual-branch Knowledge Alignment (DKA) that efficiently promotes intra-class heterogeneity knowledge transfer from prototypical space to the segmentation network. Extensive experiments on three medical image datasets with significant intra-class heterogeneity demonstrate that MPCL significantly outperforms existing methods, especially under extremely limited labeled data.

0

cs.CV 2026-07-03

Coupled diffusions recover kinematics from single images

by Wentao Zheng, Ancong Wu

PWM-ArtGen: Part World Model for Articulated Object Generation

Part World Model learns joint visual dynamics and parameters on unannotated data, outperforming baselines with zero-shot generalization.

abstract click to expand

The key challenge in articulated 3D object generation from a single image is accurately predicting the underlying kinematic structure. Existing methods either infer kinematic parameters directly from a static image that lacks dynamic part-level kinematic relationships, or estimate parameters from visual dynamics generated from a single image, which is prone to accumulated errors of two steps. Moreover, the limited scale and diversity of existing annotated datasets further hinder generalization to complex, real-world objects. To overcome these limitations, we propose to learn the joint distribution of visual dynamics and kinematic parameters. Recognizing that articulated objects can be formulated as dynamic systems, we propose a unified Part World Model called PWM-ArtGen. To leverage unannotated data, this model couples action diffusion and image diffusion with independent diffusion timesteps, which enables visual branch co-training. We further curate a photorealistic dataset of 19.7k part-level image pairs without kinematic annotations, to support co-training. Experiments demonstrate that PWM-ArtGen substantially outperforms existing baselines in the resting state and exhibits strong zero-shot generalization to out-of-distribution objects.

0

cs.CV 2026-07-03

Degrading aesthetics blocks identity leaks in custom diffusion models

by Songping Wang, Yueming Lyu +6 more

Hierarchical Anti-Aesthetics: Protecting Facial Privacy against Customized Diffusion Models

Hierarchical framework applies global and local quality reduction to limit recognizable faces from personalized AI generators.

abstract click to expand

The rise of customized diffusion models has fueled a boom in personalized visual content creation, but it also introduces serious risks of malicious misuse, thereby posing threats to personal privacy. Image aesthetics are strongly correlated with human perception of image quality. Motivated by this observation, we address facial privacy protection from a novel aesthetic perspective by degrading the generation quality of maliciously customized models, thus reducing facial identity leakage. Specifically, we propose a Hierarchical Anti-Aesthetics (HAA) framework that exploits aesthetic cues at multiple perceptual levels. HAA consists of two key branches: (1) Global Anti-Aesthetics, which degrades overall aesthetics and generation quality by constructing a global anti-aesthetic reward mechanism and a corresponding loss; and (2) Local Anti-Aesthetics, which disrupts facial identity by using a local anti-aesthetic reward mechanism and loss to guide adversarial perturbations toward facial regions. By integrating both branches, HAA achieves anti-aesthetic degradation from a global to a local level during customized generation. Extensive experiments show that HAA outperforms existing methods in identity removal, providing an effective tool for protecting facial privacy.

0

cs.CV 2026-07-03

Dual experts plus adaptive distillation enable HSI imitation in complex scenes

by Lu Pan, Hongwei Zhao

ComplexMimic: Human-Scene Interaction Imitation in Complex 3D Environments

The framework recovers usable signal from imperfect MoCap by balancing motion tracking with collision-aware adaptation and weighting hard tr

abstract click to expand

Physics-based Human-Scene Interaction (HSI) imitation learning is crucial for embodied intelligence as it bridges the gap between kinematic 3D motions and real-world dynamics. However, most existing methods focus on simplified scene settings, leaving complex environments largely unexplored, which limits their applicability in real-world scenarios. In this paper, we focus on HSI mimicry in complex environments. Under this complex setting, we observe an inherent trade-off between successfully performing interaction and maintaining natural, physically plausible motions. To address this challenge, we propose ComplexMimic, a framework that reconstructs diverse HSI by interpreting imperfect MoCap data. First, we introduce a Dual Flow Strategy, which learns two complementary experts: an imitation expert for accurate motion tracking and an interaction expert for collision-aware adaptation in complex scenes. Second, naive multi-expert distillation, which treats all experts equally, often under-samples challenging behaviors, limiting effective learning. To mitigate this issue, we propose a difficulty-aware distillation strategy that adaptively weights supervision and prioritizes hard-yet-learnable trajectories guided by failure statistics and learning progress signals. Extensive experiments on three benchmark datasets demonstrate that our approach outperforms current state-of-the-art methods. Our implementation is available at https://github.com/LuPan23/ComplexMimic.

0

cs.CV 2026-07-03

Gemini and Qwen lead zero-shot tests on Nigerian plates

by Ismail Ismail Tijjani, Ahmad Abubakar Mustapaha +2 more

Evaluating Vision-Language Models as a Zero-Shot Learning Alternative to You Only Look Once and Optical Character Recognition for Nigerian License Plate Recognition

They post the lowest character error rates among five VLMs on 88 real-world images, suggesting a simpler path than YOLO plus OCR.

abstract click to expand

License Plate Recognition (LPR) systems are critical tools in traffic monitoring, security enforcement, and urban mobility management. Traditional LPR systems often rely on a multi-stage pipeline involving object detection using You Only Look Once (YOLO) and Optical Character Recognition (OCR), which suffer from limitations such as high resource demands, poor performance in unstructured environments, and the need for large annotated datasets. This study explores the potential of Vision-Language Models (VLMs) as a unified, zeroshot learning solution for Nigerian license plate recognition. Using a curated dataset of 88 challenging real-world images collected in Nigeria, we evaluate five selected VLMs: Gemini 2.0 Flash Exp (Google DeepMind), Qwen2.5-VL-7B-Instruct (Alibaba), GPT-4o (OpenAI), Claude 4 Sonnet (Anthropic), and Llama 3.2 Vision 90b (Meta). Results based on Character Error Rate (CER) reveal that Gemini and Qwen significantly outperform other models in both accuracy and robustness, on the challenging image scenarios. This work highlights the practical advantages of VLMs over YOLO+OCR, questions the claims by model providers, and compares the performances of the VLMs.

0

cs.CV 2026-07-03

Conditioning on anatomy, time and context improves radiology report retrieval

by P. Sloan, E. Simpson +1 more

Spatio-Temporal and Clinical Conditioning for Fine-Grained Radiology Report Retrieval

STAR3 detects regions and uses exam changes plus clinical data to retrieve better sentences, beating prior methods on MIMIC-CXR metrics.

abstract click to expand

Radiology is vital to modern healthcare, but rising imaging demand and persistent workforce shortages strain reporting capacity and clinical workflows. Automated radiology report generation has the potential to support radiologists and help alleviate this burden; however, existing retrieval-based methods remain rigid, lack explicit anatomical grounding, and do not account for longitudinal disease progression or available clinical context. In this work, we introduce STAR3, a multimodal, spatio-temporal, attentive retrieval framework for radiology report generation that aligns region-level anatomical information with clinical indications and longitudinal changes across chest X-ray studies. Our framework employs an object detector to identify anatomically meaningful regions and retrieves semantically relevant report sentences conditioned on both current clinical context and changes observed between prior and current examinations. This design enables anatomically and temporally grounded report generation that better reflects clinical reporting practice. Experiments on the MIMIC-CXR dataset demonstrate that STAR3 outperforms current retrieval-based approaches on retrieval, NLP and clinical metrics, highlighting the value of conditioning retrieval anatomically, temporally and clinically for advancing automated radiology report generation.

0

cs.CV 2026-07-03

Global facade benchmark caps model accuracy at 33 IoU

by Yi Wang, Fan Wang +10 more

UnderOneFacade: Worldwide Facade Semantic Segmentation Benchmark Dataset

2.7 billion points across continents expose poor cross-domain performance on fine architectural details

abstract click to expand

Globally consistent semantic digital twins require centimeter-accurate and geographically transferable 3D facade segmentation. However, progress in facade parsing is limited by the lack of large-scale, standardized benchmarks for evaluating cross-domain generalization. Existing datasets are geographically narrow, semantically inconsistent, or insufficiently precise. We introduce UnderOneFacade, the largest cross-country and cross-continent 3D facade benchmark to date, comprising centimeter-accurate point clouds with hierarchical, harmonized, and architecturally grounded semantic labels totaling 2.7 billion annotated points. Through a systematic evaluation of representative point-, graph- and transformer-based architectures, we show that current methods struggle to recognize fine-grained architectural elements and degrade significantly across geographic domains, with the best models achieving only up to 33 IoU on the fine-grained LoFG3 benchmark. By combining geometric precision with standardized semantics at unprecedented scale, UnderOneFacade establishes a rigorous benchmark for developing robust and transferable 3D segmentation models. The dataset, evaluation scripts, and pretrained models will be released upon publication.

0

cs.CV 2026-07-03

AutoMIA designs 3D mirror illusions automatically from image pairs

by Xiaopei Zhu, Zeyuan Li +2 more

Mirror Illusion Art

Four new mechanisms stabilize joint shape and color optimization to produce smooth printable objects in about 76 seconds.

abstract click to expand

Mirror Illusion Art is a novel reflection-conditioned 3D illusion where one object yields two target appearances (front and mirror). The task is formulated as inverse design from two target 2D images (front and mirror) to a printable 3D object with geometry and texture. Prior topology-driven and shadow-based approaches demand substantial manual effort, optimize shape only, and often yield non-smooth or incomplete geometry. To address these challenges, we propose AutoMIA, an automated Mirror Illusion Art design pipeline that jointly optimizes shape and color. To stabilize optimization and suppress artifacts, four mechanisms are introduced: (1) projection-alignment component (PAC) selection to reduce surface noise, (2) position-weighted adaptive (PWA) suppression for background noise, (3) internal voxel preservation (IVP) to prevent internal fractures, and (4) shape-color decoupled (SCD) optimization that balance shape and color optimization. AutoMIA generate diverse smooth Mirror Illusion artworks successfully both in the digital and physical world, with only around 76s design time and 2.6 GB memory on average using a single RTX 3090, advancing inverse graphics and computational design. Our code is available at https://github.com/zxp555/AutoMIA.

0

cs.CL 2026-07-03

Models ace multiple choice but fail open art history tasks

by Gianmarco Spinaci, Lukas Klic +1 more

EduArt: An educational-level benchmark for evaluating art history knowledge in large language models

Real exam questions reveal format controls whether LLMs can deploy knowledge or only recognize answers.

abstract click to expand

Large language models now score near ceiling on general benchmarks, but these aggregate measures reveal little about how models behave within single disciplines. Existing art-focused evaluations rely on synthetic questions and rarely report item-level properties. This paper introduces EduArt, an educational-level benchmark for art-historical knowledge and visual reasoning in multimodal LLMs. EduArt comprises 871 human-authored questions from Italian secondary-school exercises and US Advanced Placement Art History exams, spanning two languages and seven formats from multiple choice to in-text word placement and error identification. Twelve models from six provider families were evaluated under a default answer-only condition and a motivation condition requiring written justification, and characterized using Classical Test Theory and a logistic regression isolating the effects of format, language, image presence, and model. The benchmark showed strong psychometric properties (mean discrimination 0.514, 82.3 percent good discriminators), while multiple-choice accuracy saturated near ceiling for six models, showing recognition formats alone cannot distinguish frontier models. Format was a strong independent predictor of accuracy: models exceeding 94 percent on multiple choice fell to 23.9 percent on open completion (Claude Opus 4.6) and 6.2 percent on error identification (Claude Sonnet 4.6). The motivation condition changed accuracy in a predominantly negative, family-dependent direction. These dissociations indicate that art-historical knowledge and the ability to deploy it are distinct capabilities, and that single-format benchmarks overestimate what models can reliably do. Mapping this capability profile is a precondition for responsible use of multimodal LLMs in art-historical scholarship, where tasks demand producing and manipulating content rather than selecting from fixed options.

0

cs.RO 2026-07-03

Cross disparity flags moving points for accurate SLAM

by Sujan Kumar Dhali, Bhaskar DasGupta

A Stereo Visual SLAM System Using Object-Level Motion Estimation and Geometric Filtering Based on Cross Disparity

OCD SLAM adds geometric filtering and object tracking to ORB-SLAM2, cutting trajectory error on KITTI sequences with traffic.

abstract click to expand

This paper presents OCD SLAM, a dynamic stereo visual SLAM framework that extends ORB-SLAM2 by jointly addressing dynamic objects and dynamic features in the scene. Usual visual SLAM systems operating in dynamic environments often fail in the presence of moving objects, due to the static-world assumption used in pose estimation and mapping. To address this predicament, we introduce a novel geometric approach based on the discrepancy between disparity and a newly proposed notion called ``cross disparity'', which exploits both temporal and stereo inconsistency to identify dynamic feature points. Complementary to this feature-level motion analysis, OCD SLAM integrates a 3D object detection module (SMOKE) with Kalman filter-based object tracking to perform object-level motion classification, enabling robust separation of static and dynamic scene elements for accurate pose estimation. The proposed approach has been evaluated on various sequences from the KITTI Odometry and KITTI Raw datasets. Results demonstrate that OCD SLAM achieves significant improvement in trajectory accuracy compared to ORB-SLAM2 and several state-of-the-art dynamic SLAM methods. Ablation studies further demonstrate the effectiveness of the cross disparity module in the KITTI Raw dataset and show that this method is able to detect dynamic features that are missed by the 3D object detection scheme alone.

0

cs.CV 2026-07-03

Stochastic control reframes motion diffusion to handle mixed constraints

by Xiaofei Hui, Bo Yan +3 more

Training-free Controllable Human Motion Generation under Heterogeneous Constraints

A single training-free mechanism now supports both smooth objective rules and sparse criterion feedback in human motion generation.

abstract click to expand

Training-free controllable motion generation has attracted growing interest for enabling flexible constraint enforcement without constraint-specific training. However, existing training-free methods require constraints to be continuous objective-based with differentiable losses, while many real-world requirements are criterion-based and provide only discontinuous, sparse, or even black-box feedback. In this paper, we propose Motion-Inference-as-Control (MIC), the first training-free motion generation framework that handles both continuous objective-based and criterion-based motion constraints under a shared mechanism. The key idea is to cast diffusion-based motion generation as a stochastic control problem. This perspective not only provides principled and practically effective step-wise control laws that support criterion-based constraints without requiring differentiability and naturally accommodate objective-based constraints as a special case, but also motivates a control-oriented constraint coordination mechanism that adaptively balances and reconciles motion constraints during generation. Experiments across diverse constraint settings demonstrate the effectiveness of our framework.

0

cs.CV 2026-07-03

SVD on linear probe weights isolates geometric subspaces in ViTs

by Weichen Zhou, Yawen Zou +4 more

Understanding Geometric Representations in Self-Supervised Vision Transformers via Subspace Intervention

DINOv2 aligns spatial features for linear extraction while MAE disperses them, with precision peaking in intermediate layers

abstract click to expand

We introduce a controlled subspace intervention framework to investigate how self-supervised Vision Transformers (ViTs) encode dense geometric information. While linear probing is widely used to assess geometric representations, it treats features as a black box, failing to disentangle the underlying topology. To address this issue, we decompose the weights of converged linear probes to isolate the low-rank subspaces containing explicit geometric signals using Singular Value Decomposition (SVD). Our perspective yields three key insights: (1) Pre-training objectives determine how features are encoded. DINOv2 aligns spatial features for efficient linear extraction, while Masked Autoencoders (MAE) tend to disperse these signals, requiring a broader spatial context. (2) Explicit geometric representations are highly compressible, suggesting dense predictive heads could potentially be constrained to low-rank subspaces with minimal performance loss. (3) The layer-wise task affinity suggests that geometric precision peaks at intermediate layers before yielding to semantic abstraction in the final layers. By connecting internal encoding mechanics with downstream performance, these findings provide a basis for effective feature selection and lightweight decoder design. The source code is available at https://github.com/Zhou-Weichen/Geosubprobe.

0

cs.LG 2026-07-03

Liquid states disentangle engine degradation from conditions

by Weizhi Nie, Weijie Wang +1 more

Liquid Latent State Dynamics for Interpretable Turbofan Degradation Modeling

The model lowers sensor forecast RMSE to 0.2266 and raises degradation-state correlation to 0.596 on C-MAPSS data.

abstract click to expand

Multivariate time-series models for prognostics are often evaluated by point prediction accuracy, yet their internal states rarely expose a coherent degradation process. We study liquid neural networks as latent dynamics models for aircraft engine health monitoring on the C-MAPSS benchmark. The proposed model encodes a history window into a latent state, evolves that state with a liquid transition model, and decodes future sensor observations. To separate health evolution from operating-condition variation, the latent state is factorized into degradation and condition components. Remaining useful life, monotonic risk, and latent-consistency losses supervise the degradation component, while condition prediction and decorrelation losses discourage operating-condition leakage. Across FD001--FD004, the full disentangled model improves overall sensor forecasting RMSE from 0.2438 for a GRU baseline to 0.2266, with the largest gains on the multi-condition subsets FD002 and FD004. The learned degradation state also forms a clearer temporal degradation axis, reaching an average state-speed Spearman correlation of 0.5960. Direct remaining-useful-life regression remains stronger for the GRU baseline, indicating that the proposed representation is currently more effective as an interpretable world model for degradation dynamics than as a calibrated lifetime regressor. These results suggest that liquid latent dynamics can bridge predictive maintenance forecasting and inspectable health-state modeling.

0

cs.LG 2026-07-03

Newer lightweight CNNs deliver selective gains only

by Tasnim Shahriar

Do Newer Lightweight CNNs Perform Better Under Resource Constraints? A Controlled Multigenerational Study of Architecture, Initialization, Training Budget, and Efficiency

Fixed-protocol tests on CIFAR and Tiny ImageNet place EfficientNet-B0 on every Pareto frontier and show MobileNetV3-Small beating its succes

abstract click to expand

Newer lightweight convolutional neural networks are often presented as improving predictive performance and deployment efficiency, but such claims require controlled evaluation. This study compares nine lightweight CNN model packages across CIFAR-10, CIFAR-100, and Tiny ImageNet under a shared downstream protocol. We report top-1 accuracy, macro F1, top-5 accuracy, parameter count, FP32 storage, GMACs, batch-size-1 latency on an NVIDIA L4 and AMD Ryzen 5 5500U CPU, peak PyTorch CUDA allocated tensor memory, and point estimate Pareto frontiers. EfficientNetV2-S achieves the highest observed top-1 accuracy on CIFAR-10 and CIFAR-100 at 97.57% and 86.98%, while RepViT-M1.0 leads Tiny ImageNet at 79.87%. EfficientNet-B0 remains within 0.22, 0.85, and 1.79 percentage points of the best result on the three datasets while using approximately 79% fewer parameters and 86% fewer GMACs than EfficientNetV2-S. It also appears on every evaluated accuracy and resource Pareto frontier, making it the most consistently competitive intermediate-budget option. MobileNetV3-Small has the lowest GMAC count, is the fastest model under both CPU thread settings, and records higher observed accuracy than MobileNetV4-Conv-S on all three datasets. Under random initialization, it leads MobileNetV4-Conv-S by 2.55, 1.76, and 0.99 points, with paired test-set intervals excluding zero for the fixed trained models. EfficientNet-B0 remains 3.29, 10.10, and 17.54 points below its pretrained counterpart after 100 epochs of scratch training, despite requiring about five times the recorded training time. SqueezeNet1.1 has the fewest parameters and lowest peak CUDA allocation, but substantially weaker accuracy. Latency rankings differ sharply between the L4 and CPU environments, showing that GMACs alone do not predict measured inference performance. Overall, newer designs provide selective rather than universal gains

0

cs.CV 2026-07-03

Dual-critic diffusion aligns LiDAR features for any weather

by Shuyao Li, Chuanxing Geng +3 more

Open-Weather Robust 3D Detection via Dual-Critic Diffusion Alignment

Recovers degraded scans to clean distributions using detection accuracy and distributional critics without paired examples or weather labels

abstract click to expand

Robust 3D object detection under adverse weather remains a critical hurdle for autonomous driving. Despite progress with LiDAR-4D radar fusion, most methods are constrained by a closed-world assumption, implicitly requiring training and test weather to align in both type and severity. This premise fails in practice: the open-ended nature of weather, and even variations within a single type like rain, cause dramatically different LiDAR degradation patterns, leading to significant performance drops in unseen conditions. To address this, we present Dual-Critic Guided Diffusion Alignment (DCDA), a weather-agnostic framework that learns to recover degraded LiDAR features toward a clean manifold. Rather than modeling specific weather types, DCDA employs a 4D radar-conditioned diffusion process to progressively refine features, guided by two complementary critics. (i) A detection-guided critic, anchored by a pre-trained clean-weather model, ensures that the refined features retain object-level discriminability and localization accuracy. (ii) A weather adversarial critic enforces holistic distributional consistency with clean-weather representations. By aligning features through semantic and distributional constraints rather than explicit weather modeling, DCDA generalizes effectively to unseen weather types and severities without requiring paired data or weather labels. We further introduce a structured open-weather benchmark with held-out type-severity combinations and extensive experiments verify DCDA's advantages.

0

cs.CV 2026-07-03

MolSight adds bond adjacency to vision tokens for molecular images

by Wenda Wang, Yihan Tong +2 more

MolSight: A Graph-Aware Vision-Language Model for Unified Chemical Image Understanding

Two modules for topology and grounding let the model outperform prior VLMs and chemistry tools on visual tasks.

abstract click to expand

Using molecular large language models (LLMs) as a unified framework for understanding molecular structures and functions is emerging as a new trend in tasks such as molecular design and drug discovery. However, these models struggle to fully capture the visual representation of molecular structures, limiting their potential. While existing molecular vision-language models (VLMs) show promise, they still face challenges in structural alignment and lack the necessary topological modeling for accurate molecular understanding. To address this, we propose MolSight, a graph-aware vision-language model framework designed to enhance the understanding of molecular images by VLMs. MolSight integrates a Molecular Topology Module to inject chemical-bond adjacency information into vision tokens, and a Molecular Grounding Module to align visual features with chemical symbolic semantics. Our experiments demonstrate that MolSight significantly outperforms existing VLMs, molecular LLMs, and specialized tools across multiple chemical visual understanding tasks, achieving a new level of molecular image reasoning.

0