cs.GR — Pith

Top Pith

1

cs.GR 2026-05-29

3D scene graph plans portraits before shutter click

by Ruixiang Jiang, Chang Wen Chen

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

Generates human pose, camera, and lighting setups that raters prefer over post-capture baselines while staying physically valid.

abstract click to expand

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

0

Top Pith

2

cs.GR 2026-05-18 2 theorems

One diffusion checkpoint stipples any density at fixed cost

by Ofir Gilad, Aleksander Plocharski +2 more

StippleDiffusion: Capacity-Constrained Stippling using Controlled Diffusion

Late-stage ControlNet conditioning plus gated projection lets a single model match optimized baselines while generalizing to unseen pointbud

abstract click to expand

Stipple patterns, point sets whose local density tracks a target image, are traditionally produced by per-density iterative optimizers, which are slow, non-differentiable, and must be re-run from scratch for each new target. Learned alternatives have so far addressed only unconditional point generation; capacity-constrained, image-conditioned stippling has remained out of reach. We present the first diffusion-based sampler that simultaneously satisfies a learned local point-distribution prior and a continuous, image-defined capacity constraint at inference. The method is a ControlNet branch built on top of an optimal-transport-grid point-set diffusion baseline, conditioned on the target density map and a high-resolution image. Two design choices make the combination tractable: training and inference are restricted to the late-stage denoising regime, initialized from a density-weighted rejection sample, and the standard zero-convolution injection is replaced with a sigmoid-gated 1x1 projection that preserves the base model's blue-noise structure under hard density signals. A single trained checkpoint accepts arbitrary target densities at inference, generalizes to point budgets that were not seen during training, and produces stipples in time nearly independent of the output point count. On the Icons-50 benchmark, our learned sampler reaches parity with per-density-optimized baselines on every reported metric while remaining differentiable end-to-end.

0

Top Pith

1

cs.CV 2026-05-13 2 theorems

WildRelight dataset adapts synthetic relighting models to real scenes on the fly

by Lezhong Wang, Mehmet Onurcan Kaya +2 more

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

Temporal sequences of natural light supply self-supervision that aligns models with outdoor statistics without ground-truth relit images.

abstract click to expand

Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

0

cs.CV 2026-07-03

Manifold projections extract novel views from any video model

by Jinxi Li, Tianyi Zhang +5 more

NeoMap: Training-free Novel-View Synthesis from Single Images and Videos

Training-free iterations locate consistent viewpoints already present in the data manifold of pre-trained generators.

abstract click to expand

We study the challenging problem of novel view video synthesis from single images or monocular videos. Existing methods, which operate under the assumption that pre-trained video models lack native novel view synthesis capability and enforce view alignment via camera conditioning, task-specific fine-tuning, or stepwise hard denoising guidance, often suffer from artifacts and compromised global scene consistency. In this paper, we introduce NeoMap, a novel training-free framework designed to locate high-fidelity, view-consistent novel view solutions from general pre-trained video models. The key to our approach is the core insight that promising novel view solutions are inherently encoded within the natural video data manifold learned by pre-trained models, and the core challenge is simply to locate this optimal solution. We solve this via our core mechanism: convergent manifold alternating projection iterations that optimize the initial noise. Extensive experiments demonstrate that NeoMap significantly outperforms all existing methods across 3 standard novel view synthesis benchmarks, including the challenging Tanks-and-Temples, LLFF and DAVIS datasets, achieving state-of-the-art generation fidelity and top-tier view consistency.

0

cs.CV 2026-07-03

Pixel diffusion creates 3D Gaussian splats directly in one pass

by Duy Cao, Phong Nguyen-Ha

PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation

Bypassing latent compression produces higher-quality assets with 1-second inference on a single GPU.

abstract click to expand

Recent advances in 3D content generation from text or images have achieved impressive results, yet view inconsistency from 2D generators and the scarcity of high-quality 3D data remain significant bottlenecks. Existing solutions typically adapt large-scale pre-trained text-to-image latent diffusion models to generate 3D Gaussian Splats (3DGS). However, these approaches often rely on training complex cascade pipelines that are computationally expensive and scalability-limited. Most critically, the quality of generated 3D assets is inherently constrained by each component capacity and compressed latent space, leading to decoding artifacts and accumulated errors. To address these limitations, we propose PixGS, a single-stage pipeline for direct high-quality 3DGS generation, which leverages recent advances in pixel-space diffusion to bypass lossy latent compression while still benefiting from the vast 2D generative priors. By directly denoising 3D Gaussian attributes at each timestep, our method enables precise, splat-level regularization of both appearance and geometry. Furthermore, we introduce a comprehensive supervision strategy that incorporates surface normals, depth, and high-frequency structural information, which is often overlooked in prior works. Experiments demonstrate that PixGS outperforms current state-of-the-art methods while maintaining a fast inference speed (1s on a single A100 GPU), offering a robust and efficient alternative to multi-stage generation pipelines.

0

cs.CV 2026-07-02

Auxiliary data boosts YOLO detection of howler monkeys

by Gabriel Ferri Schneider, Guido Luis Glufke Mainardi +5 more

Computer Vision for Wildlife Monitoring: Detecting Brown Howler Monkeys using YOLO

Fine-tuning with extra images reduces review time for conservationists monitoring canopy bridges.

abstract click to expand

Urban expansion threatens global biodiversity, especially affecting arboreal species due to the fragmentation of forest habitats. The movement of arboreal species across disjointed forest patches increases mortality risk and, thus, compromises their conservation. In this context, the installation of canopy bridges can be a viable strategy; yet continuous monitoring of their use by arboreal species is essential for ensuring their effectiveness, typically carried out with the aid of camera traps. However, this method often produces false-positive images that demand time from conservationists for review. In this context, computer vision algorithms can optimize the task of detecting target species using the canopy bridges. In this study, we explored the automatic detection of brown howler monkeys (Alouatta guariba) in videos obtained by camera traps. Given the need for a large number of annotated images of the target animals to train the algorithms, we tested the incorporation of auxiliary data to improve detection models, fine-tuning the YOLOv10 framework using varying proportions of them. The improvement of these automatic detection techniques contributes to conservation efforts, by providing automatic tools to monitor solutions that minimize the impact of human interference in animals habitats.

0

cs.CV 2026-07-02

Video model turns monocular video into consistent 4D Gaussians

by Liyuan Zhu, Shengyu Huang +7 more

World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

Conditioning on 3D renderings corrects artifacts and fills gaps before distilling to one dynamic scene model.

abstract click to expand

We present World from Motion, a method for generating freely renderable dynamic 3D Gaussian representations from monocular videos. Our approach conditions a video model on dense, pixel-aligned renderings that encode appearance, geometry, and 3D scene motion along both input and target camera trajectories to correct rendering artifacts and fill in missing regions from an initial reconstruction. To train this model, we construct a dataset of aligned multiview video pairs and dynamic 3DGS representations, with simulated artifacts characteristic of monocular reconstruction. At test time, we distill the model's generations, including newly observed regions and motions, back into a single consistent, high-quality dynamic 3DGS, improving both novel-view synthesis and the underlying 3D motion. Our method sets a new state of the art in 4D reconstruction and seamlessly generalizes to in-the-wild videos with large viewpoint changes and dynamic motions.

0

physics.flu-dyn 2026-07-02

Particles map hidden heat transport paths in unsteady flows

by Besm Osman, Andrei Jalba +2 more

Visualizing Lagrangian Heat Transport Paths and Density Structures in Unsteady Heat Transfer

Reparameterized spacetime advection lets massless particles trace coherent routes and attracting structures missed by temperature plots.

abstract click to expand

Convective heat transfer is traditionally visualized from a Eulerian perspective using scalar temperature fields, offering limited insight into the underlying transport mechanisms. A Lagrangian view, analogous to mass transport along fluid paths, can reveal coherent structures and transport routes invisible from a Eulerian view of temperature. However, heat transport is aperiodic and non-conservative, hampering the application of fluid mixing and transport visualization techniques, developed primarily for time-periodic, conservative transport. We present a particle-based visualization technique that addresses these challenges by advecting massless particles along a time-reparameterized spacetime formulation of thermal transport, accumulating path contributions to reveal coherent transport routes and finite-time attracting and repelling structures that conventional methods cannot show.

0

cs.CV 2026-07-02

Per-object heatmaps halve trajectory errors in multi-object video

by Omer Sela, Inbar Huberman-Spiegelglas +3 more

TrajLoc: Trajectory-Attention Localization for Multi-Object Motion Control

Substituting cross-attention weights with Gaussians improves PSNR by 4.3 dB and scales to 20 objects on two large models.

abstract click to expand

Controlling the motion of multiple objects in image-to-video (I2V) generation requires preserving object identities while enforcing adherence to distinct target trajectories. This becomes particularly challenging as the number of objects increases and their paths intersect or occlude one another. Existing approaches entangle multiple trajectories within a shared, dense conditioning signal, making object-level correspondence difficult to preserve in crowded scenes. We depart from this paradigm and enforce a strict, per object spatial constraint that isolates instances independently. Our method, TrajLoc, achieves this directly within the attention layers by substituting the cross-attention weights of each object token with a Gaussian heatmap centered on its target location at every frame. The same per object token interface carries trajectory and depth through a learned embedding and preserves identity by encoding first frame appearance in place of an object token. Evaluations across six datasets, featuring up to 20 simultaneously controlled objects and out of distribution real world scenes, demonstrate that our method consistently improves both visual fidelity and trajectory adherence. Applied to two architecturally distinct backbones (CogVideoX 5B and WaN 2.1 14B), our approach achieves average gains of +4.3 dB PSNR and a 51% reduction in trajectory end point error compared to the strongest baselines. Project page: https://sela-omer.github.io/traj-loc/

0

cs.GR 2026-07-02

Polynomial optimization produces realistic snake motion

by Utpal Khanal, Avirup Mandal

Geometric Shape Optimization for Limbless Locomotion

Fourier-Chebyshev curves with bending and torsional energies yield more physically plausible limbless locomotion than prior methods.

abstract click to expand

The simulation of locomotion in limbless, deformable organisms remains a challenging problem across computer graphics, soft robotics, and computational modeling. In this work, we present a novel differential-geometric framework for modeling the motion of slender soft bodies, such as snakes. The body is represented as a three-dimensional parametric curve using a Fourier-Chebyshev polynomial basis. Motion is computed by solving an optimization problem that determines the interaction between the curve and its environment by estimating polynomial coefficients. To ensure physically plausible and non-self-intersecting behavior, bending and torsional energy terms are incorporated into the formulation. The resulting curve is then used to drive a surface representation via interpolation, enabling realistic visualization analogous to skinning techniques. We evaluate the proposed approach across a range of complex scenarios and parameter settings to demonstrate its robustness and versatility. Comparative analysis with state-of-the-art methods indicates that our approach achieves improved simulation quality and generates more physically realistic motion.

0

cs.HC 2026-07-01

Survey: AI visualization edits rated less acceptable than human ones

by Kalina Borkiewicz, Jixian Li +2 more

May (A)I Beautify Your Visualization? Expert Judgments of Acceptable Aesthetic Alterations

Acceptability hinges on transformation meaning across levels, with stable ordering but consistently lower ratings for AI authorship.

abstract click to expand

In 3D visualizations of natural phenomena, improving aesthetics can provide measurable benefits, but often involves transformations that affect how the data is perceived. As a growing range of tools - including AI-based methods - make visual design and modification more accessible, it is increasingly important to understand trade offs and concerns when making these changes. We conducted an expert survey (N=95) with visualization researchers, practitioners, and domain scientists, investigating reactions to fifteen alterations spanning presentation-level adjustments (e.g., lighting, camera position) and data-level modifications (e.g., removing errors, filling gaps), applied by both humans and AI systems. Results show differences in perceived acceptability are driven by the transformation's meaning, regardless of whether it operates at the presentation or data level. Additionally, certain modifications were consistently judged as more permissible than others regardless of human or AI authorship. While this relative ordering remains largely stable, AI-generated transformations are consistently rated as less acceptable than identical human-produced changes. These results reveal a distinction between more permissible and more sensitive alterations, and suggest the need for both designers and AI-assisted visualization tools to incorporate constraints and guardrails that reflect these differences.

0

cs.GR 2026-07-01

Proxy labels let distillation ignore collapsed federated teachers

by Aizierjiang Aiersilan

Benchmarking Federated Learning and Knowledge Distillation for Point Cloud Classification

An 8.5 percent teacher still produces 92.9 percent student accuracy when hard-label terms reuse the original private labels.

abstract click to expand

Deploying 3D point cloud analysis in privacy-sensitive, resource-constrained settings faces two barriers: data cannot be centralized, and models must run on limited edge hardware. We present a multi-seed benchmark jointly evaluating federated learning (FL) and knowledge distillation (KD) for 3D point cloud classification. It spans 13 FL algorithms and 10 KD objectives (a 130-pair cross-product) across 504 training runs, evaluated on ModelNet40 and a clinical craniosynostosis dataset. We report three findings. First, under extreme non-IID label skew, standalone FL degrades sharply: on ModelNet40, the strongest method reaches 76.32% against a 92.26% centralized reference; on clinical data, the best reaches 75.83% against 100%. Second, distillation successfully compresses the teacher into a student 74.51% smaller and roughly twice as fast at inference, often matching or surpassing the teacher. Third, the combined pipeline exposes an evaluation pitfall: when distillation keeps a hard-label cross-entropy term on a labeled proxy split, a collapsed federated teacher (8.50%) paired with Logit-MSE still yields a 92.94% student. This 84.4-point gap reflects the proxy labels rather than the federated model, reusing the very labels whose privacy motivated federation. Objectives without hard labels instead track teacher quality ($r \approx 0.99$) and collapse when the teacher does. We therefore recommend evaluating FL-KD pipelines with label-free distillation so reported accuracy reflects the federated teacher, not the proxy.

0

cs.GR 2026-07-01

Gaussian sampling makes NURBS rendering differentiable and stable

by Jingye Qiu, Shizhe Zhou

NURBS Splatting: A Unified Differentiable Rendering Framework for Vector Graphics

The method supports rational weights, non-uniform knots and long splines by turning curves into continuous Gaussian fields.

abstract click to expand

Differentiable rendering of planar rational splines remains largely underexplored, despite their widespread use in vector graphics and design. Existing differentiable vector renderers primarily focus on B\'ezier curves and rely on analytic rasterization, which can suffer from gradient instability and limited flexibility. We propose NURBS Splatting, a unified framework that represents planar rational curves as continuous Gaussian fields. By sampling Gaussians along the curve parameter domain and inside closed regions, rendering is reformulated as a smooth accumulation process with stable gradients. Our method naturally supports long splines, rational weights, non-uniform knots, and closed-region filling. We demonstrate its effectiveness in calligraphy reconstruction, vectorization frameworks, and long-spline image abstraction, showing improved stability and reconstruction quality over existing approaches.

0

cs.GR 2026-07-01

Pipeline produces all-angle 3D models of mounted butterflies

by Kristof Overdulve, Lode Jorissen +1 more

Practical High-Fidelity Novel-View Synthesis of Mounted Lepidoptera

Handheld stacking plus non-contact mirror and mirror-aware splatting overcome depth-of-field and access limits for fragile specimens.

abstract click to expand

Mounted butterflies are among the most striking objects in natural history collections. However, their beauty is notoriously hard to digitize in 3D: they are small and fragile, with microscopic hairs and vein structures. Capturing them in sufficient detail, therefore, requires a macro lens, which has a very limited Depth of Field (DoF). Moreover, a camera body cannot be maneuvered beneath a pinned specimen to photograph its ventral surface (the underside of the wings). We introduce an end-to-end pipeline that resolves these challenges to turn such specimens into photo-realistic 3D models viewable from every direction. It combines three ingredients: handheld focus stacking for all-in-focus macro capture without a tripod, a non-contact first-surface mirror system that exposes the ventral surface without touching the specimen, and a segmentation-free, mirror-aware 3D Gaussian Splatting extension. We validate the reconstructions on four diverse specimens.

0

cs.GR 2026-07-01

Separate Gaussian sets split albedo from shading for 3D edits

by Alexandre Lanvin, Jeffrey Hu +3 more

Intrinsic decomposition and editing of 3D Gaussian splats

Decomposing splats into independent primitives lets one-image texture changes render with preserved lighting from any view.

abstract click to expand

Intrinsic decomposition which expresses image colors as the product of diffuse albedo and shading, possibly augmented with view-dependent residuals has a long history in image editing as it enables the modification of object colors and textures without altering lighting. We extend intrinsic decomposition to radiance fields represented with Gaussian splatting by proposing solutions to three key aspects of such decomposition. First, we describe how to model the intrinsic decomposition as independent sets of Gaussian primitives, which allows each set to adapt to the characteristics of the layer it represents. Second, we present an optimization procedure guided by data-driven predictions to disentangle multi-view photographs of a scene into the aforementioned intrinsic sets. Finally, we provide an editing workflow where users modify the texture of planar surfaces simply by modifying the albedo of that surface in one image. Capturing this edit within the intrinsic radiance field allows re-rendering of the edited scene with plausible lighting under arbitrary viewpoints.

0

cs.GR 2026-07-01

Dual fields encode B-rep for joint geometry and topology sampling

by Yilin Liu, Pradeep Jayaraman +3 more

DualBrep: A Dual-Field Continuous Representation for B-rep Modelling

Compressing SDF and UDF into one latent allows flow matching to generate models then rebuild explicit CAD from the continuous fields.

abstract click to expand

Boundary Representation (B-rep) is the most commonly used data format in Computer-Aided Design (CAD) due to its analytical precision and direct support for parametric editing. However, its heterogeneous structure--continuous parametric geometry combined with discrete topological graphs--poses fundamental challenges for deep learning. Existing methods often predict the heterogeneous B-rep graph directly, using fixed-size padding or sequential tokenization to handle varying primitive counts. These approaches struggle with the combinatorial complexity of CAD models. Furthermore, the discrete, non-differentiable nature of graph data prevents end-to-end optimization of geometry and watertightness. In this work, we introduce DualBrep, a novel continuous representation that unifies B-rep geometry and topology within a fully structured Euclidean domain. DualBrep encodes a CAD model using dual scalar fields: a Signed Distance Function (SDF) representing global shape geometry, and an Unsigned Distance Field (UDF) implicitly encoding topological structure via a Voronoi partitioning of surface elements. Rather than processing these fields independently, we compress them into a single latent space. While the dual-field formulation alone provides a flexible, primitive-free segmentation signal that adapts to arbitrary face counts and surface types, the shared latent makes generation tractable. A Flow Matching model can sample geometry and topology jointly from a single code, avoiding the error accumulation that plagues sequential B-rep predictors. Finally, a neural rebuilder extracts explicit B-rep models--comprising both prismatic and free-form primitives--directly from our continuous dual fields. We demonstrate that DualBrep is a robust backbone for CAD learning, achieving strong performance in point cloud reverse engineering and generative modeling via latent flow matching.

0

cs.GR 2026-07-01

Personality prompts create distinct behaviors in LLM evacuation agents

by Stefano Calzolari, Rubens Montanha +6 more

LLM-Driven Personalities for Decision Making in Emergency Simulations

OCEAN traits in language prompts lead to varied agent choices, pointing to a flexible method for realistic virtual crowds.

abstract click to expand

For virtual humans to appear believable, they must exhibit agency and spatial awareness while interacting with their environment in ways that reflect competence and intelligence. At the core of these capabilities lies effective decision-making, which strongly shapes agent behavior. With the rapid advancement of artificial intelligence, Large Language Models (LLMs) have increasingly been explored as a mechanism to support such decision-making processes. In this work, we investigate the use of LLMs to drive decision-making in virtual humans within a simulated evacuation scenario, incorporating OCEAN personality traits into agent representations. Our goal is to evaluate how personality, expressed through language-based prompts, influences both individual behaviors and collective simulation outcomes. Our results demonstrate that LLM-driven personality profiles significantly impact agents' decisions, leading to distinct behavioral patterns across different traits. These findings suggest that heterogeneous crowds composed of LLM-guided agents can enhance the realism and variability of simulated environments, offering a flexible alternative to traditional rule-based approaches.

0

cs.MM 2026-06-30

AI reconstructs Vertigo from 2.78% of its frames

by Adam Cole, Mick Grierson

Vertigo Vertigo: Reconstructing a Cinematic Ideal through its Predictive AI Double

73.1% of generated frames read as plausible, implying cinematic conventions sit inside diffusion priors.

abstract click to expand

Vertigo Vertigo is a scene-for-scene AI reconstruction of Hitchcock's Vertigo (1958), generated from only 2.78% of the original film's frames. Using this sparse set of keyframe anchors, we perform first-last frame interpolation via a large video diffusion model to predict the intervening sequences. Vertigo is itself a film about the obsessive reconstruction of an artificial ideal; Vertigo Vertigo extends this logic to the material of the film, treating the canonical text as a probe for the normative conventions of classical cinema encoded within generative systems. Evaluated through computational analysis and critical feedback from media theorists (Lev Manovich, Shane Denson, Kevin L. Ferguson), the artifact demonstrates remarkable structural fidelity: 73.1% of frames are recognizable as plausible renditions of Vertigo and only 3.6% fail catastrophically. This fidelity suggests that cinematic norms are deeply compressed within the model's latent priors. Aesthetically, the reconstruction is rendered as an unstable overlay between the original film and its predictive shadow, fueling a persistent doubt in the viewer's perception of authenticity -- a 21st-century vertigo. The work argues that generative media is not a paradigm shift from cinema but an acceleration of its logic of desire and false authenticity, extending from classical Hollywood through to the predictive media environments now reshaping contemporary perception.

0

cs.GR 2026-06-30

Ray tracing 3D Gaussians nears splatting speed

by Yohan Poirier-Ginter, Jean-François Lalonde +1 more

GRay: Ray Tracing 3D Gaussians Near the Speed of Splats

Dense tiny primitives slow rasterization but speed up ray tracing, yielding nearly 4x faster renders than prior ray tracers.

abstract click to expand

3D Gaussian Splatting (3DGS) is a popular representation for radiance field reconstruction, distinguished by the rendering speed of its rasterization-based renderer. While 3D Gaussians can also be ray traced, this approach has so far been slower, with 3D Gaussian Ray Tracing (3DGRT) taking nearly one order of magnitude longer to optimize. To address this, we present GRay, a fast ray tracer for 3D Gaussians designed to close this performance gap and match 3DGS's speed. Our method leverages the algorithmic difference between both approaches: unlike rasterization, ray tracing evaluates only Gaussians that are actually intersected by a ray, leading to potentially logarithmic--rather than linear--scaling in the number of primitives. This property allows ray tracing to better exploit dense scenes composed of numerous tiny Gaussians, a configuration which has largely been overlooked. Notably, we show that dense initialization--which creates many small Gaussians--slows down rasterization, but instead speeds up ray tracing. Designed to leverage this effect, GRay renders nearly 4x faster and optimizes nearly 10x faster than 3DGRT while maintaining similar quality, and has competitive speed with 3DGS albeit at somewhat lower quality. Code is available at https://repo-sam.inria.fr/nerphys/gray.

0

cs.GR 2026-06-30

Diffuse Gaussians plus path tracing yield editable reflections

by Yohan Poirier-Ginter, Jeffrey Hu +2 more

Editable Physically-based Reflections in Raytraced Gaussian Radiance Fields

Optimizing a specular-free scene model lets path-traced multi-bounce reflections be edited consistently in real time.

abstract click to expand

Radiance fields such as 3D Gaussian Splatting allow real-time rendering of scenes captured from photos. They also reconstruct most specular reflections with high visual quality, but typically model them with "fake" reflected geometry, using primitives behind the reflector. Our goal is to correctly reconstruct the reflector and the reflected objects such as to make specular reflections editable. We present a proof of concept which exploits promising learning-based methods to extract diffuse and specular buffers from photos, as well as geometry and BRDF buffers. Our method builds on three key components. First, by using diffuse and specular buffers of input training views, we optimize a diffuse version of the scene and use path tracing to efficiently generate physically based specular reflections. Second, we present a specialized training method that allows this process to converge. Finally, we present a fast ray tracing algorithm for 3D Gaussian primitives that enables efficient multi-bounce reflections. Our method reconstructs reflectors and reflected objects, including those not seen in the input images, in a unique scene representation. Our solution allows real-time, consistent editing of captured scenes with specular reflections, including multi-bounce effects, changing roughness, and more. We mainly show results using ground truth buffers from synthetic scenes, and also preliminary results in real scenes with currently imperfect learning-based buffers. Code and data are available at: https://repo-sam.inria.fr/nerphys/editable-gaussian-reflections/

0

cs.RO 2026-06-30

Synthetic scene data trains humanoid loco-manipulation policies

by Yen-Jen Wang, Jiaman Li +10 more

VLK: Learning Humanoid Loco-Manipulation from Synthetic Interactions in Reconstructed Scenes

48,000 generated trajectories from reconstructed rooms enable navigation and object transport on a physical Unitree G1.

abstract click to expand

Perception-based humanoid loco-manipulation requires connecting egocentric observations and task instructions to whole-body motion. Learning this mapping requires synchronized egocentric images, language commands, and robot-compatible kinematic trajectories, yet no existing data source provides this complete tuple at scale. We address this bottleneck by generating vision-language-kinematics (VLK) supervision synthetically in reconstructed scenes. Our pipeline leverages 3D Gaussian Splatting to reconstruct metric-scale indoor environments, synthesizes navigation and object-interaction trajectories using privileged scene information, and renders paired egocentric observations after the fact. We produce 48,000 paired trajectories with no human intervention and train a VLK policy that predicts short-horizon whole-body kinematic trajectories. A whole-body tracker converts these predictions into actions on the physical humanoid. We evaluate on the physical Unitree G1 performing navigation and single-object transport, demonstrating that synthesized interactions in reconstructed scenes provide effective supervision for sim-to-real perception-based humanoid loco-manipulation. Project Website: https://vision-language-kinematics.github.io/

0

cs.GR 2026-06-30

Physics biases plus object tokens scale neural global illumination

by Huangsheng Du, Haoran Zhu +3 more

RenderFormer++: Scalable and Physically Grounded Feed-Forward Neural Rendering

By guiding attention with transport rules and collapsing triangles to objects, feed-forward rendering handles large scenes with better consi

abstract click to expand

We present RenderFormer++, a scalable and physically grounded feed-forward neural rendering framework for global illumination in mesh scenes. Existing Transformer-based neural rendering methods such as RenderFormer achieve promising cross-scene generalization, but suffer from limited physical consistency and poor scalability due to the quadratic attention complexity of triangle-level tokenization. To address these issues, we introduce Physics-Informed Transport Guidance (PITG), which embeds rendering-equation inductive biases into the attention mechanism and enforces transport consistency loss, enabling physically consistent light transport modeling. We further propose Hierarchical Object-Centric Tokenization (HOCT), which aggregates triangle-level features into compact object-level tokens via cross-attention with learnable queries, substantially reducing computational and memory costs while preserving geometric and radiometric information. Extensive experiments demonstrate that RenderFormer++ achieves scalable, stable, and generalizable feed-forward global illumination rendering across complex large-scale scenes with improved physical accuracy and efficiency over prior neural rendering methods.

0

cs.CV 2026-06-30

Cultural embeddings raise gesture quality without speaker identity

by Ariel Gjaci, Antonio Sgorbissa +1 more

SICAGE: Speaker-Independent Culture-Aware Gesture Generation using TED4C-L Dataset

Domain-generalization losses isolate culture from individual style, improving realism and consistency on a new four-group TED dataset.

abstract click to expand

Recent co-speech gesture generation methods often overlook cultural differences, limiting their effectiveness in human-agent interaction. Moreover, culture-conditioned models are rarely evaluated under speaker-disjoint splits, so apparent "cultural" behavior may be confounded with speaker-specific gesturing style. We introduce SICAGE, a modular framework for culture-aware co-speech gesture generation that conditions motion synthesis models on speaker-independent cultural representations. SICAGE learns these representations from audio and text by treating each speaker as a separate domain while imposing invariance across speakers. This encourages representations to remain culture-discriminative while reducing dependence on speaker identity. The resulting cultural embeddings condition a multimodal generator to produce culturally appropriate gestures. We instantiate this idea with two domain generalization approaches: adversarial learning and Fishr regularization. We further introduce ALaDiT, a real-time diffusion-based gesture generator designed to efficiently incorporate the learned cultural embeddings. To validate our method, we built TED4C-L, a 106-hour multimodal dataset of 764 TED speakers from four cultural groups. Experiments show that SICAGE improves motion realism, diversity, beat synchronization, semantic relevance, and cultural consistency.

0

cs.GR 2026-06-30

Quantum collisions model evolving material scattering

by João S. Ferreira, Spencer S. Topel +2 more

Rendering Coherent Scattering via Quantum Collision Models

Symmetry-constrained unitary operators pre-computed on quantum hardware yield BSDFs that include chaotic interference inside classical ray t

abstract click to expand

Traditional light rendering techniques treat the optical properties of materials as static, yet this assumption breaks down in cases where these properties dynamically evolve in response to incident illumination. We present a novel shading framework that combines classical ray-tracing with a quantum collision model to explore the effect of coherent light-matter interactions in rendering. By treating incident light and material excitations as quantized modes, we model sub-surface scattering as a sequence of symmetry-constrained unitary collisions. This formulation allows for the incorporation of non-integrable dynamics and chaotic optical responses due to multi-layer interference effects. We demonstrate how these collision operators can be pre-computed using near-term quantum computers to generate standard BSDFs, enabling the rendering of new physics-inspired materials with distinct optical signatures.

0

cs.RO 2026-06-30

Gradient projection keeps robotic 3D prints collision-free at 10 μm accuracy

by Zhikai Shen, Jiasheng Qu +5 more

Trajectory Optimization for Collision-Aware Redundant Robotic Multi-Axis Additive Manufacturing by Constrained Gradient Projection

Method for 8-DOF systems cuts peak jerk 77 percent and runs 10 times faster than SQP on long support-free paths.

abstract click to expand

Redundant robotic multi-axis additive manufacturing (MAAM) enables support-free and conformal fabrication, but trajectory optimization for long-horizon paths remains challenging under strict deposition-position constraints and time-varying collision constraints. This work proposes a computational framework for collision-aware trajectory optimization in redundant robotic MAAM. We first formulate nozzle-workpiece relative kinematics using a relative Jacobian, and develop a differentiable SDF-based collision model that captures fabrication-induced geometry evolution and provides optimization gradients. The deposition position is then enforced as a hard waypoint-wise equality constraint through iterative projection onto the self-motion manifold, with the loss gradient restricted to the corresponding tangent space. Experiments on an 8-DOF robotic MAAM platform with diverse long-horizon support-free and conformal toolpaths show that our method maintains a mean nozzle-position error below 10{\mu}m, reduces maximum joint jerk by up to $77.6\%$, and eliminates all sampled collision and orientation violations. Compared with the SQP-based baseline, it achieves up to a 10.2x speedup and improved convergence. Physical fabrication experiments further verify that the resulting smooth, collision-free trajectories enable successful printing of complex geometries with fewer visible deposition artifacts.

0

cs.CV 2026-06-29

Feed-forward model turns unposed images into editable 3D object groups

by Mijin Yoo, In Cho +4 more

Scenes as Objects, Not Primitives: Instance-Structured 3D Tokenization from Unposed Views

Token groups learned from 2D views alone support segmentation, synthesis, and direct object manipulation without 3D labels.

abstract click to expand

A 3D scene is understood through its objects, not the primitives that compose them. Yet feed-forward reconstruction methods output dense, unstructured sets of points or Gaussians, leaving object-level structure to be recovered after the fact. We propose a feed-forward framework that decomposes a scene into instance-structured 3D token groups directly from unposed multi-view images -- compact object-centric units from which reconstruction, segmentation, and manipulation all follow. Each token group pairs an instance token capturing entity-level identity with anchor tokens that encode local geometry and appearance, which are decoded into a set of 3D Gaussians. This two-level factorization decouples object identity from local appearance, making object instances a native interface of the representation rather than a derived product. The token groups are learned through differentiable rendering with joint reconstruction and segmentation supervision, requiring no 3D annotations. Our feed-forward model surpasses per-scene optimization baselines in class-agnostic instance segmentation while remaining competitive in novel view synthesis. Beyond these metrics, the same token groups directly unlock instance-level scene editing -- removing, translating, or inserting objects by operating on their groups -- as well as efficient open-vocabulary 3D instance retrieval, where retrieval complexity scales with the number of instances rather than primitives.

0

cs.CV 2026-06-29

Branes with Hermite modes cut overlaps needed for continuous image zoom

by Giulio Federico, Giuseppe Amato +3 more

Resonant Brane Splatting for Arbitrary-Scale Super-Resolution

Each primitive now carries internal frequency modes so far fewer splats suffice to model edges and textures at any magnification factor.

abstract click to expand

Arbitrary-Scale Super-Resolution (ASR) reconstructs images at continuous magnification factors. Recent methods accelerate inference by replacing computationally heavy implicit neural decoders with explicit 2D Gaussian Splatting (GS). However, since standard Gaussians are smooth low-pass primitives, modeling edges and fine textures requires multiple overlapping, well-aligned splats, which creates severe bottlenecks during rasterization. To address this, we introduce Resonant Brane Splatting (RBS), a feed-forward ASR framework. RBS replaces flat Gaussians with Branes: expressive primitives that emit spatially varying colors to natively model local contrast and complex textures within a single footprint. We achieve this by augmenting the standard Gaussian envelope with internal Gaussian-Hermite modes, assigning a distinct color coefficient to each. The zero-order mode recovers standard GS, while higher-order modes capture high frequencies. We predict Brane parameters directly from low-resolution features. Because Branes provide a mathematically richer formulation than simple Gaussians, far fewer primitives need to overlap to reconstruct a given target pixel. To exploit this, we introduce an efficient fully differentiable rasterizer with a precise culling strategy based on the classical quantum turning point. This allows us to safely skip negligible regions, drastically reducing the rendering overhead. Experiments on standard ASR benchmarks show that RBS improves reconstruction quality over implicit and GS baselines, while achieving superior speed-quality trade-off than prior GS methods.

0

cs.CV 2026-06-29

Feed-forward model learns to allocate Gaussians for any-scale upsampling

by Giulio Federico, Giuseppe Amato +3 more

Learning to Adaptively Allocate Gaussians for Arbitrary-Scale Image Super-Resolution

QuADA-GS routes computation by local complexity to reach state-of-the-art arbitrary-scale results with low latency and memory use.

abstract click to expand

In computer graphics, visual content is continuously warped, zoomed and resampled. This occurs when engines upscale frames, users zoom into 3D scenes, or foveated VR applies varying scaling. Handling these transformations requires Arbitrary-Scale Super-Resolution (ASR). Traditional models, designed for fixed scales, typically predict at a lower integer scale (e.g., x4) and rely on sub-optimal interpolation for continuous resolutions, compromising quality. Furthermore, most methods process pixels uniformly. Since fine details are sparse, this creates overhead; efficiency dictates concentrating resources only where structural complexity demands it. While implicit models and Gaussian Splatting (GS) enable continuous representation, GS is advantageous due to adaptive densification. However, transitioning GS into a feed-forward model for ASR is non-trivial. Standard GS optimization needs high-resolution gradients to drive primitive growth, which are unavailable during inference. Thus, the network must autonomously predict GS densification from low-resolution inputs. To solve this, we propose QuADA-GS. After encoding inputs into a latent space, a Neural Routing Architecture evaluates local complexity to distribute a global budget, assigning specific upsampling factors to features to avoid redundant processing. Features are dynamically densified based on these factors, forming an irregular topology decoded into 2D Gaussian primitives. To coordinate features before decoding, we introduce Hierarchical Pointer Convolution. This non-grid operator achieves O(1) neighbor lookup complexity, facilitating efficient spatial communication and bypassing dense bottlenecks. Experiments show QuADA-GS achieves state-of-the-art ASR performance, maintaining low latency and a lean memory footprint.

0

cs.GR 2026-06-29

Dipole overestimates thin-slab albedo by C e^{-2τ}

by Faruk Alpay, Baris Basaran

Dipole Diffusion Error in Thin Geometry: Optical Thickness Laws for Grid-Free Subsurface Scattering

Rate is material-independent; transmittance decays as e^{-τ}; path tracer confirms exponents 1.99 and 0.99

abstract click to expand

The dipole and its descendants model subsurface scattering with a radial reflectance profile fitted to a flat, semi-infinite slab. This assumption introduces a systematic geometry error on thin and curved objects. We isolate the effect by comparing the dipole with the finite-slab multipole under the same diffusion model and boundary condition. In slab geometry the diffuse-albedo error has a material-independent leading rate, $C e^{-2\tau}$ with $\tau=T/\ell_d$, while the prefactor remains material dependent; the same image series gives the transmitted flux, whose leading decay is $e^{-\tau}$. We give the closed-form albedo and transmittance, relate the exponents to killed random walks, and extend the interpretation to spatially varying media through optical distance. A brute-force volumetric path tracer fits a reflectance-deficit rate of 1.99 and a transmittance rate of 0.99, matching the round-trip and single-pass predictions. The resulting thickness predictor is a useful thin-feature heuristic, but stress tests show that curvature and illumination can dominate away from the slab setting. For the remaining geometry-dependent term we solve the screened-Poisson diffusion problem directly inside the signed-distance domain with Walk on Spheres, without an interior mesh or a tangent half-space approximation; the estimator matches closed-form tests to 0.75%. Against a four-case path-traced benchmark it improves the back-lit, thickness-governed case but not every front-lit or curved case, showing that the method reduces geometry error within diffusion and does not replace radiative transport.

0

cs.CV 2026-06-29

Disentangled Gaussians support deformation and relighting

by Jiaxin Li, Tong Wu +3 more

DR-GS: Physically-Based Deformable and Relightable 2D Gaussians

Separating geometry, lighting and material removes baked errors and allows post-capture edits of 3D assets.

abstract click to expand

Gaussian splatting (GS) has garnered significant attention in VR/AR and digital content creation due to its explicit parameterization and efficient rendering capabilities. However, existing GS-based methods for deformable objects face two key limitations: (i) illumination is erroneously baked into textures, causing physically inconsistent responses under dynamic deformations and lighting changes; (ii) snapshot-based reconstruction restricts post-reconstruction material editing. To address these challenges, we propose Deformable and Relightable GS (DR-GS), a unified Gaussian framework that integrates physically-based inverse rendering, relighting, and deformation-aware manipulation. Through explicitly disentangling geometry, illumination, and material representations, DR-GS overcomes the limitations of static snapshots, resolving unrealistic appearance under varying conditions while enabling post-reconstruction parameter editing. Extensive experiments show that DR-GS achieves leading visual quality across static reconstruction, dynamic deformation, and relighting, reliably preserving reflections and specular highlights on glossy surfaces. It further establishes a fully decoupled geometry-illumination-material pipeline, enabling high-quality 3D asset creation and comprehensive post-editing.

0

cs.CV 2026-06-29

Iterative densification lifts feedforward Gaussian reconstruction

by Zetian Song, Chenming Wu +7 more

L2D2-GS: Learning to Densify for Feedforward Dynamic Gaussian Scene Reconstruction

Self-supervised rewards from global gains guide where to add primitives, yielding higher fidelity with fewer elements on PandaSet and Waymo.

abstract click to expand

High-fidelity reconstruction of dynamic urban environments is a cornerstone of autonomous driving simulation and large-scale world modeling. While 3D Gaussian Splatting (3DGS) has established a new standard for real-time rendering, its reliance on expensive per-scene optimization limits scalability. Conversely, recent feedforward methods that infer Gaussian parameters offer faster speed but face fundamental bottlenecks: they are memory-prohibitive at high resolutions and struggle to fuse dense multi-view observations consistently. This paper presents L2D2-GS, a unified framework that reformulates generalizable reconstruction not as a one-shot regression, but as a robust iterative process of optimization and densification. To resolve the ambiguity of supervision in primitive generation, we propose a self-supervised densification policy that derives explicit reward signals from global reconstruction gains to guide local densification. Furthermore, we mitigate irreversible early-stage artifacts through a geometric regularization mechanism, utilizing reparameterization to constrain the optimization manifold and prevent convergence to poor local optima. Extensive experiments on the PandaSet and Waymo datasets demonstrate that our method achieves state-of-the-art reconstruction fidelity and strong zero-shot generalization, while using fewer primitives than competing baselines.

0

cs.CV 2026-06-29

Pretrained transformer turns motion tokens into reusable controllers

by Yi Shi, Yifeng Jiang +2 more

GPC: Large-Scale Generative Pretraining for Transferable Motor Control

Next-token prediction on a learned vocabulary produces general-purpose physics controllers that adapt to new tasks.

abstract click to expand

Developing controllers capable of completing a wide range of tasks in a natural and life-like manner is a key challenge in enabling practical applications of physics-based character animation. In this work, we introduce Generative Pretrained Controllers (GPC), which leverage tokenization and next-token modeling to create general-purpose, reusable generative controllers from large-scale motion datasets. Our framework utilizes end-to-end reinforcement learning to jointly optimize a "motion vocabulary", modeled via Finite Scalar Quantization (FSQ), along with a corresponding control policy that can map the discrete codes to physics-based controls. After the "codebook" has been learned, the underlying structure of this large vocabulary is modeled by training a GPT-style autoregressive transformer, leading to a powerful generative controller that generates controls for a physically simulated character by performing next-token prediction. Once the generative controller has been trained, we propose a suite of adaptation techniques for finetuning the controller for new downstream tasks. Our proposed framework greatly simplifies the training process compared to previous tokenized methods, and achieves a 99.98% success rate in reproducing a vast corpus of motion clips. The generative controller exhibits a variety of natural emergent behaviors, such as responsive behaviors to perturbations and recovery behaviors after falling. This results in highly robust general purpose controllers for a variety of downstream applications.

0

cs.CV 2026-06-29

Latent propagation removes chunk seams in long video relighting

by Jing Yang, Mayoore Jaiswal +5 more

HorizonRelight: Relighting Long-horizon Videos Consistently via Diffusion Transformers

Masked self-conditioning trains the model to continue consistently from propagated target latents across arbitrary video lengths.

abstract click to expand

Diffusion-based video relighting enables controllable relighting from a single input video, but modern video diffusion backbones are trained on short clips and applied to long-horizon videos through chunked sliding-window inference, often causing temporal discontinuities at chunk boundaries. We address this by reframing long-horizon relighting as \emph{temporally conditioned latent domain translation}. Our framework enforces cross-chunk continuity by propagating target-domain latents across boundaries and makes this behavior learnable using \emph{masked target-domain self-conditioning}, training the model to continue from temporally masked propagated context. We further introduce \emph{warm-start prompting} with a relit prompt anchor from a controllable generative model, which establishes the initial target-domain state and creates a general interface for prompt-based relighting. Experiments on in-the-wild long-horizon videos show markedly improved temporal consistency, with chunk-boundary artifacts largely reduced and unwanted appearance changes across chunks greatly suppressed.

0

cs.CV 2026-06-29

Area emitters correct lighting errors in relightable Gaussian scenes

by Mohamed Shawky Sabae, Philipp Langsteiner +2 more

AEGIR: Modeling Area Emitters for Indoor Inverse Rendering using Gaussian Splatting

Modeling the physical extent of local lights produces accurate shadows and attenuation for better decomposition and editing.

abstract click to expand

Inverse rendering requires separating illumination from surface materials, which is highly ambiguous due to their tight coupling in observed images. While Gaussian Splatting is efficient for novel view synthesis, existing relightable methods approximate scene lighting using discrete point lights, global environment maps, or implicit representations. By ignoring the physical spatial extent of real-world emitters, these approaches produce incorrect light attenuation and unrealistic shadows. We present AEGIR (Area Emitters for Gaussian Inverse Rendering), a framework that explicitly models local area emitters within a relightable Gaussian Splatting representation. Joint optimization of emitters, materials, and geometry is challenging due to flexible emitter parameterization, which increases both the number of parameters and the ambiguity between illumination and materials. We address this by introducing a differentiable deferred rendering pipeline that integrates multiple importance sampling with targeted regularization. As a result, AEGIR accurately simulates local light transport and achieves more consistent decomposition. Experiments show that explicit area emitters improve illumination reconstruction and enhance downstream tasks, including novel view synthesis, controlled relighting, and virtual object insertion, particularly in scenes with complex local lighting.

0

cs.CV 2026-06-29

Nested mesh shells enable differentiable rendering with standard rasterizers

by David Charatan, Daniel Xu +3 more

Meshtryoshka: Differentiable Rendering of Real-World Scenes via Mesh Rasterization

The representation scales to real-world unbounded scenes and approaches non-mesh novel view synthesis quality.

abstract click to expand

Differentiable rendering has emerged as a powerful approach for 3D reconstruction and novel view synthesis. State-of-the-art differentiable rendering methods combine a variety of custom representations of 3D geometry and appearance with specialized renderers. However, most downstream tasks in computer graphics rely on 3D meshes. While prior work has attempted differentiable rendering with mesh representations, these approaches are limited to object-centric scenes and fail to reconstruct large-scale, unbounded scenes. In this work, we introduce Meshtryoshka, a novel mesh differentiable rendering framework that combines an off-the-shelf triangle rasterizer with a 3D representation that consists of nested mesh shells which resemble a matryoshka doll. In every forward pass, the mesh shells are extracted anew from a 3D signed distance function via iso-surface extraction, and the opacities for each vertex are computed as a function of signed distance. Each mesh shell is then rasterized independently, and the final image is created via alpha compositing. Crucially, mesh vertex positions are updated only indirectly via gradients that flow through the opacity values into the signed distance function, and hence, our method is compatible with off-the-shelf mesh renderers that need not be differentiable with respect to vertex positions. On object-centric scenes, our method performs competitively with surface-based differentiable rendering techniques. Our differentiable mesh rendering method scales to unbounded, real-world 3D scenes, where it yields high-quality novel view synthesis results approaching those of state-of-the-art, non-mesh methods. Our method suggests that it may be possible to solve the differentiable rendering problem without relying on specialized renderers, only using conventional tools from the computer graphics toolbox.

0

cs.CV 2026-06-29

Keypoint control adds style to speech-driven 3D face animation

by Arthur Josi, Emeline Got +3 more

KM-Speaker: Keypoint-Based Style Control for High-Quality Speech-Driven 3D Facial Animation and Dialogue Localization

The method separates lip motion from upper-face dynamics to deliver precise reference-based control without large datasets.

abstract click to expand

Speech-driven 3D facial animation methods face significant challenges in simultaneously achieving high-fidelity motion and precise artistic control at production quality. Existing controllable models typically learn global style control by relying on large-scale, low-quality \emph{in-the-wild} datasets that compromise overall animation realism. Furthermore, these frameworks often lack the fine-grained temporal precision required for demanding tasks such as dialogue localization (e.g., dubbing), where matching specific facial expressions is as critical as lip synchronization. We present KM-Speaker (Keypoint-Matching Speaker), a novel keypoint-conditioned flow-based generative framework that provides both global style guidance and frame-level temporal control from reference performances. We propose a disentanglement strategy that separates audio-driven lip motion from keypoint-driven upper-face dynamics, together with a global style context preservation mechanism to ensure coherent full-face expressiveness. KM-Speaker advances example-based 3D facial animation by achieving high-fidelity motion and flexible controllability in a data-constrained setting, consistently outperforming state-of-the-art methods in lip-sync accuracy, style adherence, and expressive temporal control.

0

cs.CV 2026-06-29

Human feedback lifts monocular videos to 4D object interactions

by Jiaxin Li, Yuxiang WU +12 more

HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

Framework pairs vision models with limited corrections to handle occlusions and produce training assets for embodied AI.

abstract click to expand

Extracting dynamic 4D object interactions from massive, in-the-wild monocular videos offers a highly efficient data collection pathway for scaling Embodied AI and training VLAs. However, existing monocular 4D reconstruction methods primarily focus on isolated objects, often failing under the severe occlusions and complex dynamics inherent in multi-object interactions. To bridge this gap, we propose HAT-4D, the first agentic framework designed to reconstruct the 3D geometry, temporal dynamics, and physical interactions of multiple objects from a single video. By integrating VLMs with a multi-level human-in-the-loop feedback mechanism, HAT-4D efficiently resolves depth ambiguities and interaction-induced occlusions during 3D generation and 4D propagation, yielding physically plausible assets without relying on expensive multicamera rigs. As a scalable data engine, HAT-4D facilitates the creation of MVOIK-4D, an open-world benchmark for monocular 4D interaction reconstruction, accompanied by a novel multi-dimensional evaluation protocol focused on physical plausibility and temporal consistency. Extensive experiments demonstrate that HAT-4D achieves SOTA performance on most evaluation metrics, while maintaining competitive semantic alignment. Ablation studies show that introducing a small amount of human feedback improves interaction reconstruction. Moreover, the data produced by HAT-4D effectively improves baseline performance when used for fine-tuning. Our data and code are available at https://lijiaxin0111.github.io/HAT4D/

0

cs.CV 2026-06-29

AU-guided dynamic graphs lift micro-expression recognition across datasets

by Nandani Sharma, Varun Sharma +1 more

STAG: Spatio-temporal Evolving Structural Representation of Action Units for Micro-expression Recognition

STAG selects motion frames, adapts facial connectivity by muscle activation, and fuses graph and transformer features via cross-attention fo

abstract click to expand

Micro-expression recognition is challenging due to subtle and short-lived facial muscle movements. Existing methods rely heavily on apex-onset frames, overlook fine-grained inter-frame dynamics, and separately model spatial and temporal information, limiting generalization across datasets. To address these challenges, we propose STAG, a dynamic ROI-AU-coupled spatial-temporal network that jointly models motion flow and adaptive facial connectivity. The framework extracts optical flow from discriminative frames using magnitude-based selection and temporal attention. A dual-branch architecture combines an enhanced graph attention network for structured spatial reasoning with a transformer encoder for temporal modeling. A bidirectional cross-attention module enables mutual refinement of spatial and temporal features, while AU-guided dynamic connectivity adapts facial region interactions according to muscle activation patterns. The transformer captures subtle temporal dynamics beyond apex-based approaches, improving semantic consistency and interpretability for explainable micro-expression recognition. The fused representation is optimized using focal loss and evaluated on CASME II, 4DME, DFME, NaME, SAMM, and SMIC-HS. Extensive experiments demonstrate improved robustness, generalization, interpretability, and computational efficiency, confirming the effectiveness of adaptive relational reasoning, AU-guided dynamic connectivity, and deep spatial-temporal feature fusion for accurate cross-dataset micro-expression recognition.

0

cs.GR 2026-06-29

Neural method extracts clean albedo textures from wild photo sets

by Guangyu Wang, Tianheng Lu +2 more

DANTE-W: Diffuse Albedo Neural Texturing in the Wild

DANTE-W fuses view priors onto a mesh to separate diffuse color from baked lighting for relighting.

abstract click to expand

Classical mesh texturing techniques blend captured multi-view images directly, which inevitably suffer from baked-in shading and casted shadows that compromise visual fidelity during relighting. To circumvent this issue, we present a neural texturing framework, namely DANTE-W, to enable high-fidelity diffuse albedo texture recovery from unstructured image collections for large-scale, in-the-wild scenes, which integrates seamlessly with traditional 3D reconstruction pipelines. Given a reconstructed mesh and its surface parameterization, our method fuses view-space generative albedo priors into a coherent texture space via an expressive neural representation, while substantially enhancing fine-grained textural details through physically principled neural rendering. To comprehensively evaluate our method, we curate a benchmark dataset featuring diverse, fine-grained textures, comprising both real-world in-the-wild scenes and synthetic objects. Extensive experiments verify the effectiveness of our approach in reconstructing accurate albedo textures and boosting relighting fidelity. Project page: dante-wild.github.io.

0

cs.GR 2026-06-26

Neural decoder turns RGB albedo into spectral skin scattering

by Carlos Aliaga, Adrian Jarabo

Spectral Subsurface Scattering from RGB via Biophysical Skin Inversion

Three-media mixture enables path-traced subsurface scattering from single RGB input without hand tuning.

abstract click to expand

In this paper we present a spectral optical inversion for skin for path tracing-based rendering of subsurface scattering. Skin is a complex multilayered medium, with appearance determined by the mixture of biophysical chromophores. However, current methods rely on medium homogeneization, with optical parameters obtained via albedo inversion from a reflectance texture and hand-tuned scattering distance and anisotropy. This results into significant art-skilled manual labor for authoring, and an inaccurate scattering profile for skin. To solve these problems, we generalize existing albedo inversion techniques, and propose a framework that predicts full-spectral skin scattering parameters from a single RGB diffuse albedo. Our method builds upon a new mixture-of-media representation, that approximates the aggregated multilayered appearance of skin by mixing the aggregated scattering of three uncorrelated media. We train a chained neural decoder that maps RGB diffuse albedo to the optical properties of the mixture of media, including anisotropy, scattering radius and scattering albedo. Then, we show this mixture can be used in a random-walk-based path tracer with minimal modifications, by simply randomly selecting the medium to traverse.

0

cs.GR 2026-06-26

Continuous embedding enables parallel flow-matching mesh generation

by Chunshi Wang, Haohan Weng +10 more

PolyFlow: Continuous Topology Embedding Flow Matching for Artist-style Mesh Generation

Discrete adjacency recovers from per-vertex states via distance thresholding, letting the model beat autoregressive baselines on geometric a

abstract click to expand

Autoregressive Transformers dominate high-quality mesh generation by producing artist-worthy topologies, yet their inherent sequential decoding induces substantial computational overhead, falling orders of magnitude slower than parallel generative models. On the other hand, while continuous diffusion and flow-matching methods support efficient parallel synthesis across a variety of domains, they cannot be directly applied to meshes: mesh connectivity is inherently discrete and incompatible with standard continuous noise injection and denoising operations. To resolve this fundamental incompatibility, we introduce a compact topology embedder that projects discrete mesh vertex positions and normals into continuous per-vertex embeddings, where the original discrete adjacency information can be faithfully recovered via spacetime distance thresholding. After pretraining and freezing this embedder, any raw mesh can be fully converted into a continuous per-vertex state space unifying position, normal, and implicit topological attributes. Built upon this novel continuous mesh representation, we present PolyFlow, a Transformer-based flow-matching framework that achieves fully parallel vertex state denoising conditioned on extracted point-cloud features. During inference, our model completes generation rapidly via an ODE solver, and supports explicit, precise control over output mesh resolution by directly specifying the target vertex count. Extensive evaluations on the Toys4K benchmark demonstrate that PolyFlow surpasses state-of-the-art autoregressive baselines in both Chamfer Distance and Hausdorff Distance.

0

cs.CV 2026-06-26

Decoupling predicates by transformation behavior under yaw shifts produces…

by Jingjun Sun, Chaowei Wang +5 more

Not All Relations Rotate Alike: Transformation-Aware Decoupling for Viewpoint-Robust 3D Scene Graph Generation

Separate stable and directional branches improve relation predictions under viewpoint change without rotation training data.

abstract click to expand

3D Scene Graph Generation (3DSGG) represents 3D scenes as structured object-relation-object graphs, providing a compact relational abstraction for spatial understanding. In embodied intelligence settings, the same 3D scene may be observed by agents from viewpoints that differ by yaw rotations. However, current 3DSGG models often fail to produce relation predictions that follow the expected transformation behavior under such viewpoint shifts. This behavior reveals an empirical mismatch related to predicate-level transformation heterogeneity: directional predicates such as left, front, right, and behind should transform with the observation frame, whereas most contact, support, and semantic predicates such as standing on and attached to should remain stable. To reduce this mismatch, we propose Transformation-Aware Decoupling (TAD), a viewpoint-robust 3DSGG framework that decouples relation reasoning according to predicate transformation behavior and is supported by viewpoint-stable object representations. TAD decomposes relation reasoning into two parts: one learns cues that should stay stable across viewpoints, while the other learns directional cues that should change with the observation frame. The two parts are merged for standard multi-label predicate prediction. Transformation-specific descriptors and group-aware auxiliary supervision encourage the two branches to capture complementary relation cues. Extensive experiments on 3DSSG show that TAD achieves state-of-the-art robustness under yaw viewpoint changes without training-time rotation augmentation, while maintaining competitive performance under the standard benchmark. The project page is available at https://tad-predicate.github.io/.

0

cs.GR 2026-06-26

Vis4GS traces 3DGS artifacts to individual Gaussian events

by Kai-Yuan Lin, Aryabima Mandala Putra +2 more

Vis4GS: A Visual Analytic Tool for 3D Gaussian Splatting Reconstruction

Four linked views connect visible failures to primitive properties, timelines, and densification history instead of relying on final images

abstract click to expand

3D Gaussian Splatting (3DGS) supports fast training and real-time rendering, but its optimization process remains difficult to interpret. Existing viewers mainly expose the final reconstructed scene and offer limited support for explaining how Gaussian properties contribute to visible artifacts or evolve during training. We present Vis4GS, a multi-view visual analytics tool for primitive-level diagnosis of 3DGS reconstruction artifacts. Built on the original 3DGS viewer and training framework, Vis4GS links rendered artifacts to Gaussian properties, View Coverage, training progress, and Gaussian genealogy through four linked views: an interactive Gaussian analysis view, a property timeline view, a Gaussian densification tree view, and a log and control panel. The system supports Gaussian selection, blur and needle-like artifact scoring, View Coverage analysis, and multiscale genealogy exploration of clone, split, prune, and clone-split events. By connecting scene-level artifacts with primitive-level evidence and optimization history, Vis4GS enables a structured workflow for diagnosing reconstruction failures beyond final-image inspection and global metrics. A user study also shows that Vis4GS provides stronger support for usability and artifact understanding than the original 3DGS viewer.

0

cs.GR 2026-06-26

Hypernetwork produces both latents and decoders for texture compression

by Belcour Laurent

Neural Texture Compression using Hypernetworks

Removes per-material optimization step while matching quality of existing neural compressors

abstract click to expand

Recent work on neural texture compression has demonstrated that it is possible to learn small, per-material texture representations (composed of latent textures and a small Multi-Layer Perceptron decoder) that can be decoded in real-time during shading to reproduce the input to a physically based shading model. However, existing methods require performing gradient-descent optimization per material for a given MLP and latent configuration. In this work, we train a single hypernetwork that outputs both the latent features and the MLP's weights and biases. Though the solution space is high-dimensional, this approach produces results comparable in quality to the current reference neural texture compressors. We further extend this approach to infer multiple decoders at once or even produce decoders that learn super-resolution.

0

cs.GR 2026-06-26

Refined 3D meshes keep visual details after texture is lost in single-color prints

by Chentao Shen, Chen Jia +4 more

Appearance-Preserving Refinement of Generated 3D Assets for Monochromatic Fabrication

GenMF turns texture cues into shading geometry while controlling stress so printed objects stay recognizable and buildable.

abstract click to expand

Recent advances in 3D mesh generation have enabled the creation of visually realistic assets. However, much of their visual fidelity is encoded in textures rather than geometry. When such assets are fabricated using monochromatic materials, texture information is largely lost, causing visually important details to disappear even when the original geometry is faithfully preserved. A key challenge is that the geometric perturbations required to recover texture-dependent appearance cues often introduce sharp local features and high-frequency surface structures, which may increase stress concentration and fabrication risk. In this paper, we present GenMF, an appearance-oriented geometry refinement framework for monochromatic fabrication. GenMF transforms texture-dependent visual cues into geometry-induced shading effects and formulates geometry refinement as a balance between appearance preservation and fabrication-oriented robustness. To discourage structurally and narrow the gap between simulation and physical manufacturing, we further introduce a differentiable stress-aware regularization based on a learned thermal-stress predictor. Experimental results demonstrate that GenMF significantly improves appearance preservation under monochromatic rendering while reducing stress concentration under a consistent thermo-mechanical simulation setting. Physical 3D printing examples further show that the refined geometries preserve more recognizable visual details while remaining suitable for fabrication. These results suggest that appearance-aware geometry refinement provides an effective bridge between generated 3D assets and fabrication-ready monochromatic objects.

0

cs.CV 2026-06-26

Neural model initializes material extraction from images

by Kim Youwang, Jon Hasselgren +4 more

Extracting Neural Materials from Multi-view Images

LMRM supplies base color, latents and uncertainty to guide inverse path tracing, improving decomposition over PBR baselines on synthetic and

abstract click to expand

Neural materials can represent complex specular reflections and scattering effects in a compact, universal basis. However, acquiring and authoring such materials remains challenging. We present NeuMatEx, a differentiable inverse rendering method for extracting spatially varying neural materials from images. The nonlinear structure of neural material latent spaces makes optimization with naive inverse rendering infeasible. To address this, we train a Large Material Reconstruction Model (LMRM) that directly predicts initialbase color, neural material latents, and aleatoric uncertainty guides from images. This material prior provides a good initialization and better constrains our subsequent optimization using inverse path tracing. The predicted uncertainty further helps by anchoring high-confidence regions more tightly to the LMRM prediction, preventing lighting and complex specular effects from being baked into materials. Experiments on synthetic and real assets show that NeuMatEx extracts complex materials with better visual quality and material decomposition than PBR-based methods.

0

cs.HC 2026-06-26

Mixed-initiative system improves control in scientific visualization

by Kuangshi Ai, Patrick Phuoc Do +1 more

HiLSVA: Design and Evaluation of a Human-in-the-Loop Agentic System for Scientific Visualization

HiLSVA pairs LLM agents with human oversight to raise task completion and transparency at the cost of speed.

abstract click to expand

Large language model (LLM) agents enable natural language interaction for scientific visualization (SciVis). Still, prior systems have essentially prioritized autonomy over human analytical control, thereby limiting transparency and human oversight. We present HiLSVA, a human-in-the-loop agentic system that supports mixed-initiative SciVis workflows. HiLSVA integrates a plan-first multi-agent architecture with explicit human oversight, stepwise provenance tracking, and learn-at-test-time adaptation from user feedback. The system supports fluid handoff between humans and agents through both natural language and direct manipulation of visualizations, while sandboxed execution ensures safe, reproducible workflows. In doing so, HiLSVA reframes agentic SciVis as a collaborative process that augments, rather than replaces, human analytical reasoning. We evaluate HiLSVA through representative case studies and a controlled user study with twelve participants of varying expertise across multiple autonomy settings. Results show that mixed-initiative interaction improves task completion, user control, and workflow transparency across different levels of user expertise, while revealing a tradeoff between execution efficiency and human oversight. These findings highlight the importance of human-centered design in agentic SciVis and guide the development of future collaborative visualization systems. We encourage readers to explore our demo video, case studies, and source code at https://hilsva.github.io/.

0

cs.GR 2026-06-26

Light-path hierarchy updates only affected pixels first in path tracing

by Rafael Padilla, Andrew Tate +2 more

HiPR: Hierarchical Progressive Rendering for Immediate Feedback

By tracing dependencies outward from scene changes, the scheduler delivers instant visual feedback while still reaching an unbiased result.

abstract click to expand

Hierarchical Progressive Rendering (HiPR) is a dynamic render-scheduling algorithm that makes interactive path tracing finally feel real-time. While most renderers recompute the entire frame after any change to the scene, our method updates some of the pixels based on a priority order while keeping the others unchanged. Rather than relying on error-driven or temporal reuse heuristics, it amortizes rendering costs by organizing pixels into a hierarchy of light-path dependencies from changed elements outward, prioritizing by perceptual impact and delivering instant visual feedback, while eventually converging to an unbiased result.

0

cs.CV 2026-06-26

Semantic pipeline replaces raw image downlinks in LEO networks

by Ziyi Yang, Hao Yuan +3 more

SpaceRipple: Lightweight Semantic Delivery for Mission-Oriented LEO Earth Observation Satellite Networks

SpaceRipple compresses on sensing satellites then restores and extracts task info on edge satellites to cut bandwidth while preserving detec

abstract click to expand

Earth observation satellite networks generate massive volumes of high-resolution imagery, whereas inter-satellite and downlink resources remain limited. In many time-sensitive missions, ground users require mission-relevant semantic information rather than a full raw-image downlink. This paper proposes SpaceRipple, a lightweight framework for mission-oriented semantic delivery and on-board processing in Earth observation satellite networks. A sensing satellite performs adaptive compression and metadata generation to reduce inter-satellite traffic, while an edge computing satellite restores the received representation and extracts task-relevant semantic information. Unlike fidelity-driven image transmission, SpaceRipple coordinates compression, forwarding, restoration, and semantic inference within a collaborative pipeline, enabling semantic-oriented delivery instead of pixel-level image delivery. A compression-aware MoE enhancement module is further introduced to improve robustness under degraded visual inputs. Experimental results show that SpaceRipple achieves favorable reconstruction quality, improved semantic detection performance, and substantial bandwidth savings, demonstrating its potential for efficient and reliable Earth observation under constrained satellite-network resources.

0

cs.CV 2026-06-25

Residual distillation turns sparse 2D anchors into coherent 3D street scenes

by Long Cao, Zhongquan Wang +5 more

From Sparse and Imperfect 2D Anchors to Consistent 3D Gaussian Street Scenes: Support-Aware Appearance

Baking supported appearance from teacher residuals into fixed Gaussian coefficients improves alignment and suppresses noise without extra mo

abstract click to expand

Image priors can synthesize target conditions for 3D Gaussian street scenes, but independently edited views do not define a coherent 3D target. Direct fitting can propagate view-specific noise, while existing pipelines do not jointly handle imperfect sparse anchors and standard-rasterizer deployment. To address this gap, teacher-relative appearance residual distillation is introduced for appearance baking. A structured space for frequency decomposition, confidence estimation, and primitive-level lifting is formed by residuals between teacher anchors and original renders. The direct optimization signal is supplied by renderer-space matching, while primitive assignment is regularized by support-aware Gaussian-space aggregation. Supported detail is admitted and unsupported noise is suppressed through confidence-gated coarse-to-fine optimization, after which all residuals are baked into fixed-geometry spherical-harmonic coefficients. The teacher and auxiliary training modules are discarded at inference. Evaluation across Waymo street assets, Tanks and Temples scenes, and multiple target conditions shows a favorable overall balance of target alignment, content preservation, artifact suppression, and cross-view consistency over editing-based baselines. Ablations confirm the effectiveness of the main components. Code will be released at https://github.com/Cagares/Baking-for-3D-Gaussian.

0

cs.LG 2026-06-25

Domain-specific AI outperforms general models on scientific figures

by Davie Chen

Can AI Draw Science? A Benchmark for Evaluating Scientific Figure Generation by Text-to-Image and Multimodal Models

New benchmark reveals largest gaps in semantic correctness and convention adherence across eight figure types

abstract click to expand

Text-to-image and multimodal generative models are increasingly used to produce scientific figures such as mechanism diagrams, experimental-design schematics, conceptual frameworks, and graphical abstracts. Yet existing image-generation benchmarks (e.g., GenEval, T2I-CompBench, DPG-Bench) evaluate natural images and measure compositionality, object counting, or photorealism. None of them measure what makes a generated scientific figure usable: correct and legible text labels, faithful depiction of entities and their relations, coherent diagrammatic structure, and adherence to disciplinary drawing conventions. We introduce SciDraw-Bench, a benchmark of 32 structured scientific-figure generation tasks spanning eight figure types and ten disciplines, where each task pairs a natural-language prompt with a machine-checkable specification of required labels, relations, components, conventions, and negative constraints. We propose a four-dimensional evaluation protocol: Text Fidelity (OCR-based label recall and character error rate), Semantic Correctness (vision-language-model judging against the specification), Structural Quality, and Convention Adherence, together with a meta-evaluation protocol and a preliminary inter-judge reliability analysis (human-rating validation is ongoing). We evaluate a domain-specific system, SciDraw AI, against representative general-purpose text-to-image models, and outline a code-to-figure baseline as a planned extension. In a pilot over all eight figure types, the domain-specific system substantially outperforms the general-purpose baselines on every dimension and figure type, with the largest gaps on semantic correctness and convention adherence; text fidelity remains the hardest dimension for all systems.

0

cs.CV 2026-06-25

Path-traced stereo pairs hide variance correlation aligned to disparity

by Po-Ting Lin

Cross-View Variance Correlation in Path-Traced Stereo:A Hidden Shortcut in Synthetic Training Data

The alignment persists across sample counts and supplies an unintended matching cue for networks trained on rendered data.

abstract click to expand

Path-traced synthetic stereo data underlie a large fraction of modern disparity-estimation training pipelines. We report a previously unrecognised property of such data: while the Monte Carlo (MC) noise streams of the two cameras are statistically independent, the underlying \emph{variance fields} -- deterministic per-pixel functions of the rendering integrand -- are highly correlated once aligned by the ground-truth disparity warp. Across 20 scenes rendered with Mitsuba~3, the warped Pearson correlation reaches $\rho{=}0.754{\pm}0.016$ across 20 scenes at $\mathrm{SPP}{=}512$, and on a representative scene remains essentially invariant ($\rho{=}0.778{\pm}0.001$) over a $16\times$ range of samples per pixel. The effect is strongest in Lambertian regions ($\rho{\approx}0.78$) and substantially weaker in glass ($\rho{\approx}0.30$), as predicted by an integrand decomposition into view-independent and view-dependent components. A residual-shuffle intervention that breaks the cross-view alignment while preserving the clean image degrades the GT cost margin by $33\%$ on non-glass and the variance-based winner-take-all accuracy on glass by $4.3\times$, confirming the structure functions as a matching cue. This signal is unique to MC-rendered data and constitutes a candidate sim-to-real shortcut whose impact on trained networks remains to be quantified.

0

cs.NE 2026-06-25

Spacing objective alone evolves flock alignment

by Craig Reynolds

EvoFlock: evolved inverse design of multi-agent motion

Genetic search finds parameters where neighbor distance maintenance produces coordinated group motion without dedicated alignment rules.

abstract click to expand

This paper describes an automatic method for adjusting or tuning models of multi-agent motion. Simulating the motion of bird flocks, human crowds, vehicle traffic, and other multi-agent systems is a widely used technique. These simulations model the behavior of a single group member (bird, human, or vehicle). The group behaviors (flock, crowd, traffic) emerge from interactions between group members. These models typically have many numerical control parameters. Even if each parameter is intuitive in isolation, their interaction can be complex and nonlinear. It is challenging to determine which parameters to adjust for the desired change in group behavior. Changing one aspect of group behavior often causes other aspects to change, leading to a tedious process of incremental changes. This work takes an inverse design approach. The desired group behavior is measured with a user-defined objective(/fitness/loss) function and optimized with a genetic algorithm. The objective function used here for basic flocking rewards proper spacing with neighbors, flying near a desired speed, and avoiding obstacles. Interestingly, the vivid alignment seen in bird flocks appears to emerge from maintaining proper spacing between flockmates.

0

cs.CV 2026-06-24

Cage filtering spots zones to cut texture artifacts in 70ms

by Rose Mei Zhou, Lynnette Hui Xian Ng +3 more

Cage-based Texture Transfer with Geometric Filtering

Approach runs on mobile hardware without training or large models for real-time transfer

abstract click to expand

Real-time texture transfer expands the creative horizon for interactive applications, enabling seamless detail projection in scenarios that range from digital character cosmetics to procedural automotive texturing. Yet, its practical application is governed by inherent trade-offs between processing speed and suppression of artifacts. Low-latency transfer methods frequently fail to suppress artifacts, and robust alternatives rely on large-scale models that are costly in training and memory. Our proposed method bridges the gap between efficiency and robustness by using a cage-based geometric filtering method to identify Non-Cosmetic Zones (NCZs) for artifact suppression. While other models are resource-intensive and require multiple days of training on manually annotated datasets, we are able to successfully suppress artifacts and achieve immediate deployment on consumer-grade hardware. Our framework achieved highly efficient runtimes of ~70ms on mobile devices for a ~4.8k triangle mesh.

0

cs.GR 2026-06-24

Self-supervised model adds lasting wrinkles to cloth

by Xiaoyuan Yang, Deshan Gong +2 more

Self-supervised Garment Dynamics with Persistent Wrinkles

A changing loss function plus curriculum training from elastic to plastic behavior lets the simulator keep realistic folds after deformation

abstract click to expand

Self-supervised neural garment simulation has become popular due to its computational efficiency, good visual realism, and no reliance on training data. However, existing methods greatly simplify the mechanical properties of fabrics, ignoring persistent wrinkles caused by plasticity. Although this simplification allows for modeling of purely elastic material and simple training via energy minimization, the lack of believable wrinkles adversely affects the visual realism. Therefore, we introduce the first self-supervised neural garment simulator that explicitly models persistent wrinkles. This is accomplished through a novel physics-inspired loss function, which turns learning into a moving energy minimization problem to mimic plasticity. However, this requires learning to use a changing loss function, which causes difficulties in training because the loss function changes during optimization. To this end, we propose a new physics-inspired curriculum learning scheme where the target material for learning gradually changes from pure elasticity to elasto-plasticity, allowing the loss function and the learnable parameters to jointly converge. Through a comprehensive evaluation, we show that for the first time, self-supervised learning models can generate natural persistent wrinkles, outperforming existing methods on a variety of garments, body shapes, and body motions, according to a range of metrics.

0

cs.CV 2026-06-24

One Transformer streams full-duplex audio and video at sub-second latency

by Lianghua Huang, Zhi-Fan Wu +23 more

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

Interleaved multimodal tokens and block-causal attention let a single model handle perception, generation and timing together, reaching roug

abstract click to expand

We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-visual interaction. Wan-Streamer seamlessly models language, audio, and video as both input and output within a single Transformer, where the sequence is represented as interleaved visual, audio, and text input tokens together with visual, audio, and text output tokens, coordinated by block-causal attention for incremental streaming. Unlike cascaded interactive systems that rely on separate VAD, ASR, language, TTS, audio-driven animation, or video-generation modules, Wan-Streamer does not rely on external language, speech, avatar, or video-generation modules: perception, reasoning, generation, response timing, turn management, and cross-modal synchronization are learned jointly within one unified model, reducing pipeline latency and error accumulation. To support natural audio-visual responsiveness, we redesign the entire stack around streamability, including causal encoders, causal decoders, block-causal attention, and low-latency multimodal token scheduling, enabling streaming units as short as 160 ms at 25 fps. Wan-Streamer achieves approximately 200 ms model-side response latency and approximately 550 ms total interaction latency when combined with 350 ms bidirectional network latency, supporting sub-second duplex audio-visual communication. These results position Wan-Streamer as a unified, end-to-end, multimodal interactive foundation model for low-latency streaming interaction.

0

cs.HC 2026-06-24

Opinion maps reveal broad consensus hidden in U.S

by Lisa Schirch, Beth Goldberg

Visualizing "We the People": Bridging the Perception Gap through Pluralistic Data Storytelling

AI tools turn 2,400 participants' views into interactive landscapes that show shared values instead of simple divides.

abstract click to expand

Traditional visual data storytelling relies on binary graphics that depict two simplified groups in conflict. This can increase political polarization by oversimplifying intra-group disagreements and erasing ambiguity and shared ideas or values. This can inadvertently foster "us versus them" thinking. Intentional, pluralistic design choices for AI-enabled digital platforms can produce visualizations that emphasize nuance, opinion distribution, and intergroup commonalities. To demonstrate this potential, we examine deliberative technologies that map high-dimensional opinion spaces and highlight areas of both consensus and dissensus. The paper highlights the We the People deliberation conducted by Jigsaw and the Napolitan Institute in September 2025, which engaged over 2,400 Americans across all 435 congressional districts in an AI-supported, asynchronous dialogue regarding freedom and equality. By utilizing AI to synthesize long-form, text-based participant inputs into interactive "opinion landscapes," the initiative provided an alternative format for pluralistic data storytelling that humanized diverse viewpoints and revealed hidden areas of substantial broad consensus. The paper concludes that shifting from divisive, contrast-heavy visual frameworks to distribution-focused, interactive models represents a highly scalable, low-cost intervention capable of bridging perceptual gaps and cultivating a more resilient, collaborative democratic culture.

0

cs.CV 2026-06-24

Single portrait yields instant drivable 3D Gaussian avatar

by Kim Youwang, Zhengyu Yang +9 more

FiCA: Feed-forward instant Gaussian Codec Avatars from a Single Portrait Image

Diffusion mapping plus feed-forward refinement produces photorealistic heads that animate in real time without per-person tuning.

abstract click to expand

We introduce FiCA, a Feed-forward, instant Gaussian Codec Avatar generation pipeline that creates lifelike avatars from a single portrait image. Generating a photorealistic and drivable avatar from just a single image is significantly challenging due to the limited visual information available to accurately infer the 3D appearance and geometry of human heads. To address this, we develop a novel system that combines human-centric vision foundation models with a diffusion model. This system is designed to fully exploit partial visual observations to generate lifelike human avatars. Our proposed diffusion model learns a generative mapping from these partial observations to complete and authentic 3D mesh reconstruction. Additionally, we introduce a feed-forward mesh refinement network that enhances the fidelity and identity preservation of the generated avatars, eliminating the need for person-specific test-time optimization. By leveraging a universal prior model that decodes a generated mesh into a set of 3D Gaussians, we generate a photorealistic 3D Gaussian avatar, capable of being driven with novel expressions in real-time. Our experiments demonstrate that the avatars generated by our feed-forward approach faithfully represent diverse identities and surpass the visual quality of avatars produced by recent competing methods.

0

cs.CV 2026-06-23

Token alignment turns prompt embeddings into interpolable space

by Saar Huberman, Ron Mokady +2 more

Token-to-Token Alignment of Text Embeddings for Semantic Blending

Matching corresponding concepts across prompts lets simple averaging produce coherent image transitions without model changes.

abstract click to expand

In modern generative models, images are specified and controlled through text prompts. In practice, images are generated from sequences of tokens derived from these prompts. However, the space of token sequences lacks a consistent accessible structure: semantically similar images may correspond to sequences that differ in wording, ordering, and placement of concepts, while similar token sequences may encode very different semantics. This apparent lack of structure makes it difficult to perform smooth transitions in this space, hindering applications such as image blending and continuous control of edits. We argue that this limitation stems not from the absence of semantic structure, but from misalignment between representations. To address this misalignment, we introduce Token-to-Token alignment, a framework that establishes explicit semantic correspondence between tokens across prompts. Our approach transforms prompts into a structured representation in which semantically corresponding concepts are mapped to consistent positions across prompts, and then aligns their token embeddings based on semantic similarity. Concretely, the method consists of two stages: a structural alignment that rephrases prompts into a shared structured form, followed by an embedding-level alignment that matches token representations across prompts. With this alignment in place, simple linear interpolation becomes a meaningful operation, producing smooth and coherent semantic transitions and enabling applications such as blending and continuous editing. Our results show that text embedding spaces in text-to-image models implicitly encode a continuous semantic structure that becomes accessible once representations are properly aligned, suggesting that semantic control can be achieved by organizing existing representations rather than modifying the generative model.

0

cs.CV 2026-06-23

Text-level changes create navigable semantic image galleries

by Sara Dorfman, Maya Vishnevsky +3 more

Semantic Browsing: Controllable Diversity for Image Generation

Vision-language model workflow generates meaningful prompt variants instead of incidental image noise.

abstract click to expand

Modern text-to-image models excel in visual fidelity and prompt adherence. However, this strict adherence comes at the cost of diversity: generated samples tend to collapse into a single visual interpretation. Existing methods to improve diversity produce outputs driven by incidental variations rather than meaningful design choices. This motivates a new variant of the diversity task where structure is enforced on the generated samples. We introduce a method for controlled diversity that enables Semantic Browsing, where users can navigate structured image galleries and experience creative exploration through a systematic traversal of meaningful, interpretable axes of variation. Achieving this level of semantic control requires a deep understanding of the scene. We exploit the fact that recent text-to-image models are trained on elaborated captions, effectively decoupling semantic decision-making from pixel generation. This enables a paradigm shift: instead of relying on stochastic variation within the text-to-image model, we induce diversity directly at the text level. By leveraging rich textual representations, we allow a Vision Language Model (VLM) to operate on the full scene context. To overcome the generic outputs typical of standard VLMs, we employ an agentic workflow that explicitly enforces structured variation attuned to the original prompt. We demonstrate that our method produces diverse and navigable design spaces where every variation corresponds to a specific, user-understandable semantic decision.

0

cs.CV 2026-06-23

Constraint meshes guide 3D generation via token routing

by Jan-Niklas Dihlmann, Andreas Engelhardt +3 more

Arbor: Explicit Geometric Conditioning for Controllable 3D Asset Generation

A routed attachment inside a frozen denoiser enforces hull, avoidance and touch rules without extra losses or quality drop.

abstract click to expand

Text and image conditioned 3D models now generate convincing assets, but they still offer little direct control over the space an object should occupy or avoid. In authoring, this spatial intent is often known before generation starts. A chair should fit a seating envelope, a prop should leave clearance for motion, or a part should expose a contact surface. Prompts and image views are poor carriers for such constraints, requiring the need for an explicit control interface. We present Arbor, a trainable attachment for text conditioned latent 3D generation. Arbor introduces constraint meshes as a native 3D control interface. The interface uses hull regions where geometry should exist, avoidance regions that should remain empty, and touch regions the object should contact. Unlike completion or whole object scaffold control, these meshes are not target evidence. They are local typed requirements and can include regions where no surface should appear. Arbor keeps this signal as geometry by converting constraint meshes into tokens and learning a routed attachment inside a frozen denoiser. Each latent region can therefore receive the part of the constraint that matters for its spatial location. We evaluate Arbor on automatic and artist curated control benchmarks with hull, avoidance, and touch constraints, and compare the metric trends to a user preference study. Even without dedicated compliance losses, Arbor improves constraint obedience while preserving object quality and variation under fixed constraints.

0

cs.GR 2026-06-23

Equivariant flow matching produces meshes at 18x autoregressive speed

by Qi Sun, Kiyohiro Nakayama +7 more

MeshFlow: Mesh Generation with Equivariant Flow Matching

Direct triangle-soup generation respects face and vertex permutations and reaches comparable quality without sequential tokenization.

abstract click to expand

Meshes are among the most common 3D scene representations, but directly generating meshes is challenging because the representation contains important symmetries, including permutation invariance of faces and vertices. MeshFlow learns to generate triangle meshes directly as triangle soups, avoiding the need to serialize meshes into long autoregressive sequences. We adopt equivariant optimal-transport flow matching models that respect the key symmetries of triangle soups: arbitrary permutations of faces and permutations of the vertices within each face. Toward this goal, we propose a simple yet effective modification to the Diffusion Transformer architecture, resulting in a scalable network capable of modeling a velocity field while maintaining the desired equivariance. We further introduce an optimal-transport-based training objective that improves convergence by eliminating supervision signals that violate these symmetries. MeshFlow achieves mesh quality comparable to state-of-the-art autoregressive mesh generators while providing about an 18$\times$ speedup during inference. Project page is at https://qiisun.github.io/MeshFlow/.

0

cs.GR 2026-06-23

Close-range cameras enable high-fidelity 4D human mesh dataset

by Giulia Martinelli, Niccolò Bisagno +3 more

VolHuMe: a High-Resolution Large Scale Dataset of Volumetric Human Meshes

VolHuMe captures 104 subjects with 64 RGB and 32 depth cameras to improve reconstruction benchmarks

abstract click to expand

We introduce VolHuMe, a dataset of high-quality 4D human scans captured with a state-of-the-art volumetric studio using 64 RGB and 32 depth cameras. VolHuMe contains individual captures of 104 subjects and provides extensive ground truth, including SMPL-X, high-resolution meshes, multi-view RGB/depth images, rigged meshes, point clouds, garment segmentation, and detailed hand and facial geometry. Unlike prior datasets that primarily rely on full-body imagery, VolHuMe uses a close-range, high-resolution capture setup that preserves fine-grained body-part details, improving geometric fidelity and texture resolution. We benchmark VolHuMe on state-of-the-art methods across 3D and 4D human reconstruction tasks, showcasing the dataset's quality and exposing the limitations of current evaluation testbeds.

1 0

0

cs.GR 2026-06-23

Transformed RoPE controls texture tiling precisely in diffusion models

by Junrong Huang, Zhiyuan Zhang +3 more

Controllable Texture Tiling with Transformed RoPE-Enhanced Diffusion Models

Affine transforms applied to positional embeddings set repetition frequency and angle without warping pixels or losing reference structure.

abstract click to expand

Realistic integration of user-specified textures into scene images is a fundamental task in computer graphics and image editing. While existing material transfer and reference-guided inpainting methods can edit surface appearances, they often fail to address the specific requirements of texture tiling. This task necessitates precisely repeating a reference pattern according to user-defined parameters such as frequency, orientation, and scale. Furthermore, current generative approaches often struggle to maintain the structural fidelity of the reference texture, limited by either destructive pixel-level resampling or the lack of fine-grained spatial information in semantic image encoders, and they frequently fail to preserve the coherent lighting and geometry of the original scene. In this paper, we propose a novel framework for controllable and high-fidelity texture tiling based on Diffusion Transformers. Our approach introduces two key technical innovations to decouple spatial manipulation from content generation. First, we propose a Coordinate-Transformed Rotary Embedding mechanism. By applying 2D affine transformations directly to the relative positional embeddings between the target latent and the image condition, we achieve precise control over tiling patterns without explicit pixel warping, thereby utilizing the full information of the reference condition without degradation. Second, a Disjoint Attention Mask is employed to shield reference features from semantic leakage. This preserves structural integrity while seamlessly blending the synthesized texture with the scene's original lighting and geometry. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in both control accuracy and texture fidelity.

0

cs.GR 2026-06-23

Jacobian bound during simplification yields smaller accurate base meshes

by Congyi Zhang, Nicholas Vining +6 more

DJM: Compact Base Meshes for Displacement Mapping using Triangle Jacobians

By keeping a minimum Jacobian on the displacement map at every QEM collapse, the method improves accuracy-to-size ratio over prior base-mesh

abstract click to expand

Representing complex geometry as a displacement function defined over a coarse base mesh enables compact storage and accelerated rendering. The core challenge in converting detailed triangle meshes into this representation is computing base meshes that have as few triangles as possible, while also supporting displacement functions that accurately approximate the input. Accurate approximation requires the supported displacement functions to bijectively map the input surface onto the base with low parametric distortion. We observe that this distortion can be measured by evaluating the pointwise Jacobian of the displacement functions. Our new DJM (Displacement Jacobian Metric)-based base-mesh construction method uses the Jacobian of the displacement functions to guide base mesh computation, enabling us to outperform prior approaches in terms of accuracy to size trade-off. We achieve this goal by proposing a variant of the QEM-based simplification scheme that constrains the displacement mapping between the input and the base to be bijective and low distortion (defined as satisfying a lower bound on the mapping Jacobian). When evaluating and encoding the displacement maps, we avoid unreliable ray-mesh intersections by explicitly storing the mapping between the input mesh and the base throughout the construction process, and use this mapping within a robust inverse barycentric displacement solver to obtain dense base-to-mesh correspondences to assist all computations. We demonstrate DJM to outperform alternative schemes in terms of reconstruction accuracy to size trade-off, and demonstrate its robustness and usability for micromesh-based rendering and neural encoding.

0

cs.GR 2026-06-22

Flying speck illuminates letters with 42-56 mm error

by Hamed Alimohammadzadeh, Shahram Ghandeharizadeh

Illuminating English Letters Using a Flying Light Speck

Human study finds presentation order significantly affects how long detection takes

abstract click to expand

This paper presents the design and implementation of a Flying Light Speck (FLS) to illuminate English letters. The FLS uses its onboard camera and computing to localize and follow a trajectory to illuminate a letter. We evaluate the illuminations quantitatively and qualitatively. The latter is based on an IRB approved human subject study with 20 participants. The obtained results show a 42 to 56 millimeter error that impacts the detection of letters. A key finding is that the order in which the illumination of letters is presented to subjects has a significant effect on detection duration.

0

cs.LG 2026-06-22

Concept prototypes anchor prompts to lift CLIP base-to-new scores

by Na Sang, Ding Ma +2 more

Concept-Constrained Prompt Learning for Few-Shot CLIP Adaptation

Text-space consistency with frozen prototypes yields +0.6 and +2.9 harmonic-mean gains on DTD and EuroSAT under matched splits.

abstract click to expand

Few-shot prompt learning is an effective strategy for adapting CLIP to downstream tasks, but class-only prompt optimization can overfit base-class supervision and weaken transfer to unseen classes. We propose Concept-Constrained Prompt Learning (CCPL), a lightweight regularization framework that anchors learnable class prompts to frozen concept-level text prototypes without updating CLIP encoders. CCPL learns a set of shared context tokens, instantiates class prompts by appending class names, and constructs frozen concept prototypes from a class-level concept bank. During training, a text-space cosine consistency objective aligns learnable class-prompt embeddings with frozen concept prototypes; concept dropout provides additional regularization against over-reliance on fixed concept lists. At inference, CCPL optionally fuses class-prompt logits with concept-prototype logits using a controllable ensemble weight alpha. Our default configuration uses text-space concept regularization lambda = 0.5, concept dropout p = 0.3 and weak concept-guided fusion (alpha = 0.1), with no KL-based prediction consistency term. Experiments under identical automatically-generated fallback splits show that CCPL improves the base-to-new harmonic mean on DTD (+0.6) and EuroSAT (+2.9) compared with CoOp, while remaining near-neutral on OxfordPets (-0.1). Ablations indicate that text-space concept regularization is consistently beneficial, while the best concept-guided inference strength is dataset- and protocol-sensitive. These results suggest concept constraints are most effective when concept prototypes align naturally with dataset semantics, and identify fine-grained categories as a current boundary condition. The code is released at: https://github.com/richael-sang/concept-constrained-prompt-learning.

0

cs.GR 2026-06-22

Drone swarms illuminate line drawings mid-air

by Hamed Alimohammadzadeh, Shahram Ghandeharizadeh

Line Drawings using LightBenders: Authoring and Illuminating

Hardware and software achieve 10.1 mm misalignment that users rate 8 out of 10 for visual quality.

abstract click to expand

This study presents the hardware and software architecture of a transformative system for illuminating line drawings and letterforms. These mid-air illuminations are indoors and might be animated. The hardware contribution is a drone equipped with servo-actuated rod joints and a dense, addressable LED strip that enables arbitrary orientation, a LightBender. The software contributions are threefold. First, the system implements algorithms and heuristics to estimate the minimum number of LightBenders required to render a line drawing or letterform, stagger swarm formations to mitigate LightBender downwash, generate Swarm Flight and Lighting (SFL) files, and execute these files using a swarm of LightBenders to illuminate line drawings and letterforms. Second, a Blender add-on enables users to register LightBenders, author graphics and animations represented by swarms of LightBenders, and deploy the swarm for illumination through one-click functions. Third, users may import SVG files into either the Blender add-on or a standalone LB-Author tool to illuminate line drawings directly from vector graphics. We present results from an IRB-approved human subject study (n=21) to evaluate the impact of LightBender misalignment on the perceived illuminations. Obtained results demonstrate that the system's 10.1 mm maximum misalignment is perceptually acceptable across tested illuminations, with a median quality rating of 8 on a 0-10 scale.

0

cs.GR 2026-06-22

Diffusion model fixes lighting for 3D object transfers

by Nicolás Violante, George Kopanas +3 more

Lighting-Consistent Object Transfer Across Radiance Fields

Harmonized views of pasted objects are consolidated into a coherent new 3D Gaussian Splatting scene.

abstract click to expand

3D Gaussian Splatting (3DGS) is widely used to capture and render real scenes. Compositing objects from one capture into another has applications in many domains, such as VFX, architecture and interior design, or marketing. However, extracting an object from a source scene and naively pasting it into a target scene will fail to produce realistic results due to the different lighting conditions between the two scenes. To address this problem, we introduce a diffusion model that harmonizes naively composited images with inconsistent lighting. The model is trained with a heterogeneous dataset of image pairs (inconsistent composite input, consistent output), combining synthetic, generated, and real data. Our complete 3D solution allows a user to extract an object from the source scene and composite it into the target scene. From this, the (inconsistent) views of the target scene with the composite object are rendered. Our diffusion model harmonizes each one of these views, which are finally consolidated in a 3DGS representation with a post-optimization step. Our method provides visually compelling results, making object transfer between 3DGS easy to use and significantly improving quality compared to previous methods.

0

cs.GR 2026-06-22

Mesh2GS turns meshes into 3DGS at Nyquist sampling rates

by Haoran Zhu, Youcheng Cai +3 more

Mesh2GS: White-Box 3DGS Construction via Plenoptic Sampling

Plenoptic theory sets the exact view count and Gaussian layout needed for real-time global illumination from mesh input.

abstract click to expand

3D Gaussian Splatting (3DGS) has emerged as a promising method for high-quality, real-time 3D reconstruction. To associate 3DGS with mesh representations, existing methods primarily focus on 3DGS-to-mesh reconstruction from multi-view images. In contrast, the problem of converting a mesh into 3DGS has received comparatively less attention. Instead of relying on heuristic strategies that bind 3D Gaussians to the mesh, we propose a novel white-box 3DGS construction framework, termed Mesh2GS, which generates 3DGS directly from mesh geometry based on plenoptic sampling theory, achieving Nyquist-level performance for high-quality global illumination rendering. Firstly, we propose a plenoptic sampling guided 3DGS construction strategy that theoretically derives the minimum sampling rate of the sampled views and the distribution of 3D Gaussians. Second, we propose a novel 3DGS update procedure with albedo--shading decomposition for efficient global-illumination capture. Finally, we introduce a neural illumination enhancement module to handle non-Lambertian effects. Experimental results demonstrate that our method surpasses state-of-the-art baselines and is practically effective for both real-time shared rendering and non-Lambertian effects capturing specular highlights. The project code will be released upon acceptance.

0

cs.GR 2026-06-22

Guard restricts bit-flip damage in 3D Gaussian splatting to 11.68% frame

by Faruk Alpay, Baris Basaran

Single-Event Upsets in 3D Gaussian Splatting Rendering: Bit-Level Criticality, Spatial Extent, and a Parallel Support Guard

Clamping parameters to training ranges stops single primitives from dominating the image even after 20,000 simultaneous upsets.

abstract click to expand

Three-dimensional Gaussian splatting is a standard real-time scene representation increasingly deployed on hardware exposed to transient faults, such as spaceborne processors and robotic edge devices where silent data corruption occurs. A trained model is a large array of floating-point parameters in GPU memory, where a single-event upset corresponds to a single flipped bit. This paper measures these effects and constructs a defense. A GPU-resident parallel fault-injection engine applies over 3.8 million controlled single-bit upsets across four scenes, six fields, all bit positions, and three numeric formats (fp32, fp16, bf16), using 5.3 GPU-hours. The effect is highly concentrated: most upsets leave the image perceptually unchanged due to high redundancy, but a small set of high-order bits principally the logarithmic scale's sign bit enlarge a single primitive to cover up to 75.7% of the frame. A closed-form perturbation bound derived from the IEEE-754 layout and pipeline activations predicts this per-bit ordering. This concentration motivates a support guard: a per-primitive clamp of each parameter to the coordinate box observed during training, costing 76 us per frame. Over 768,000 guarded upsets, the worst corruption footprint is restricted to 11.68% of the frame. We prove the guard leaves clean models unchanged and prevents frame-covering corruption. Under an accumulated dose of 20,000 simultaneous upsets, the unguarded renderer degrades to 10.6 dB, whereas the guarded renderer remains at 21.8 dB. The corruption footprint also dictates the number of tile/compositing nodes contaminated in distributed renderers, where the per-node guard contains it.

0

cs.GR 2026-06-22

Framework unifies 3D splats with meshes and fluids for full-scene physics

by Xiaoyang Liu, Shangzhe Wu +1 more

Scene-Level Heterogeneous Physics Simulation with 3D Gaussian Splats

A single particle set lets captured environments, deformable splats, and other assets interact realistically for the first time.

abstract click to expand

3D Gaussian Splatting (3DGS) has achieved state-of-the-art photorealistic rendering, but the representation gap prevents these assets from being physically interactive. Production-grade physics engines do not understand the 3DGS representation, while prior physics-for-3DGS methods are monolithic silos. These prior works are fundamentally limited, demonstrating only object-centric physics in isolated environments, such as on an ideal plane. They are incapable of interacting with complex static collision geometry or heterogeneous assets. We propose a novel framework that, for the first time, bridges this gap by enabling 3DGS assets to participate in scene-level, heterogeneous, multi-solver physical simulations. Our core contribution is a Representation Abstraction Framework that translates all diverse assets, including 3DGS, virtual meshes, and fluids, into a unified physical particle set. This abstraction is key to enabling complex behaviors, such as the non-rigid deformation of 3DGS assets, within a unified physics pipeline. This particle set, along with the static scene collision boundaries derived from scene capture, is processed within a solver-agnostic physics kernel. The physical results are then mapped back to drive each asset's specific visual reconstruction. This architecture unlocks capabilities impossible with prior art. We demonstrate complex, two-way interactions between deformable 3DGS assets, standard CG assets such as fluids and meshes, and large-scale captured static environments, showcasing realistic coupled phenomena that were previously unattainable.

0

cs.CV 2026-06-22

Physics engine creates microscope data that detectors use on real images

by Caio Silva

OSOG: A Differentiable, Physics-Informed Synthetic Data Engine for Micro-Optical Environments

YOLO models trained only on OSOG synthetic output transfer zero-shot to occluded Lysozyme micrographs without fine-tuning or adaptation.

abstract click to expand

Deep learning in computational microscopy is severely constrained by the scarcity of densely annotated datasets. While synthetic data generation has bridged this gap in macroscopic computer vision, traditional graphics engines rely on geometric ray-tracing, failing to capture the micro-optical phenomena required for microscopy. Conversely, while wave-optics formulations exist, rendering them computationally tractable at the scale required for deep learning remains a massive systems challenge. To address this, we introduce the Optical Synthetic Object Generator (OSOG), a high-performance, fully differentiable forward-modeling engine. Drawing on established physical models of diffraction and phase retardation, OSOG maps continuous Optical Path Difference (OPD) calculations into a highly optimized, PyTorch-native Structure-of-Arrays (SoA) architecture. We validate this computational framework across three axes: First, object detection models (YOLOv11-OBB) trained purely on OSOG-generated data achieve robust zero-shot transfer to real-world highly occluded Lysozyme micrographs. Second, we introduce DiffOSOG, demonstrating that the engine's end-to-end differentiability allows for the exact recovery of continuous optical parameters via curriculum-guided inverse rendering. Finally, OSOG bypasses the $\mathcal{O}(N)$ bottlenecks of sequential ray-tracing, demonstrating sub-linear scaling by synthesizing 40,000 complex wave-optic particles in under 50 milliseconds (\>20 FPS). By providing a fast, scalable, and physically grounded tensor pipeline, OSOG enables true real-time, on-the-fly dataset generation.

0

cs.GR 2026-06-22

Decoupling velocity from deformation lets avatars interact physically

by Sang-Hun Han, Min-Gyu Park +4 more

PIAvatar: Physically Interactive Avatars via Deformation Gradient Decoupling

Removes stress that blocks target poses and keeps closed-form skeletal tracking during contact.

abstract click to expand

3D human avatars have shown impressive visual fidelity driven by pose-conditioned models, yet they still lack the physical ability required for interactions with each other and environments. Although recent studies have made various attempts to incorporate physical characteristics into 3D avatars, they only exhibit limited physical deformations, often leading to constrained interaction behaviors. To resolve this issue, we present PIAvatar, a framework to simultaneously enable physically aware interactions between avatar-avatar and avatar-environment, and a non-rigid deformable human body simulation. In this work, our key insight is to decouple kinematic velocity from deformation gradient. When external forces act on avatars, the kinematic velocity induces stress which hinders the avatar's ability to achieve a desired pose. In addition, we integrate a skeletal framework within the avatar. It allows estimating its poses and real-time tracking in a closed form, even during non-rigid physical interactions. Our approach is implemented within a conventional Material Point Method framework to ensure physically consistent dynamics. We lastly evaluate the method on both human-object and human-human interaction scenarios to assess its behavior under diverse interaction settings.

0

cs.CG 2026-06-22

Door and non-adjacency rules yield floor plans via graph method

by Rohit Lohani, Krishnendra Shekhawat

DPLAN: Minimal Connectivity to Floorplan Generation

Minimal edge additions and separating-triangle removal turn connectivity constraints into non-overlapping rectangular or orthogonal layouts.

abstract click to expand

Automated floor plan generation is an important problem in computational architectural design. The goal is to construct a floor plan from user-defined room numbers and door requirements. The user specifies which rooms must share a door and which rooms must not be adjacent. However, these requirements do not determine the exact placement or shape of the rooms. The task is therefore to arrange the rooms in a single floor plan so that all required door connections are satisfied and no rooms overlap. To address this problem, we propose DPLAN (Door Connectivity to Floor Plan Generation), a graph-based prototype that generates floor plans from door and non-adjacency constraints. The framework operates in three stages. First, the user-defined graph is examined and, if disconnected, additional edges are added to connect its components. Second, a bi-connected plane triangulation is constructed to ensure the existence of a floor plan without overlapping rooms or empty spaces. Third, the triangulated graph is transformed into floor plans. For rectangular floor plans (RFPs), separating triangles are removed by modifying edges without adding new vertices, thereby avoiding the creation of extra rooms. For orthogonal floor plans (OFPs), separating triangles are removed by introducing additional vertices, allowing rectilinear room shapes. By enforcing both door and non-adjacency requirements, the framework generates floor plans that satisfy the given constraints. The method is implemented in Python and includes a prototype for interactive constraint specification and floor plan visualization. Currently, the framework supports rectangular plot boundaries. Future work includes support for non-rectangular plots, dimension-based scaling, and circulation modeling.

0

cs.CV 2026-06-22

Diffusion model conditions motion on body shape and gender

by Dongseok Shim, Julian Tanke +6 more

Odoriko: A Shape-Aware Multimodal Diffusion Framework for Human Motion

Odoriko produces text, music or video driven motions that match the specific mover's morphology in one unified model.

abstract click to expand

Human motion generation has been widely studied across diverse input modalities, text, music, and video, and recent efforts have unified these into single multimodal frameworks. However, while morphological factors such as gender and body shape are known to produce distinct kinematic signatures, no existing unified framework incorporates this into generation, treating all subjects as morphologically equivalent. We present Odoriko, the first unified multimodal motion generation framework that reflects subject bio-morphological information directly in synthesized motion output. Rather than averaging over subject variation, Odoriko generates motion that is consistent with who is moving, not just what they are asked to do, across text, music, and video conditions within a single model. When explicit morphological information is unavailable, Odoriko additionally recovers subject morphology alongside motion, unifying estimation and generation in one framework. Extensive experiments across text-to-motion, music-to-dance, and video-to-motion benchmarks demonstrate that Odoriko matches or exceeds prior specialized models on standard metrics, while enabling morphology-consistent generation that no existing unified framework supports.

0

cs.CV 2026-06-19

Stochastic ray SDF models improve surface reconstruction

by Hiroki Sakuma, Masatoshi Okutomi

Stochastic Signed Distance Processes

First-passage probabilities derived via Bayesian filtering outperform deterministic volume rendering on DTU and MobileBrick benchmarks.

abstract click to expand

Multi-view surface reconstruction is a core problem in computer vision. One prominent line of work represents the surface implicitly as a signed distance field (SDF), optimizing it based on the photometric loss between rendered and observed pixel colors. These approaches typically employ SDF-based volume rendering to obtain a differentiable relaxation of discontinuous visibility along rays, thereby reducing reliance on silhouette supervision. In this paper, we reformulate SDF-based volume rendering as probabilistic surface rendering, where each pixel color is modeled as a mixture distribution induced by the random first ray-surface intersection. To this end, we introduce Stochastic Signed Distance Processes (SSDP), which model the SDF along each ray as a stochastic process, inducing a first-passage-time distribution for each ray. We then derive the first-passage probability for each sampling interval based on Bayesian filtering, together with its practical approximation for parallel rendering. We further show that NeuS, an existing SDF-based volume rendering method, arises as a special case of our formulation. Experiments on the DTU and MobileBrick datasets demonstrate that our method outperforms baselines in both surface reconstruction and uncertainty quantification, supporting the effectiveness of our first-passage formulation. Our code is available at https://github.com/skmhrk1209/SSDP.

0

cs.LG 2026-06-19

Group elements as tokens enable closed-form attention on affine groups

by Przemyslaw Musialski

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

The relative pose logarithm supplies an intrinsic score that matches learned kernels with 50-80x fewer parameters and preserves exact invari

abstract click to expand

We place the attention token on the group: a token is an element $g_i$ of a matrix Lie group $G$ -- a bare transformation, with no feature payload and no external action $\rho(g)$ carrying it. To our knowledge this is the first attention construction whose tokens are bare matrix Lie group elements: their score is the closed-form algebra norm of the relative pose rather than a learned kernel, and it reaches the affine full-frame groups that every irrep- or surjective-exp-based method must exclude. We call it Lie-Algebra Attention. Once tokens are group elements, the rest follows with none of the usual representation-theoretic machinery. The relative geometry of a pair is canonical, $g_i^{-1} g_j$, so the pairwise invariant $w_{ij} = \log(g_i^{-1} g_j)$ is intrinsic rather than designed; equivariance under the diagonal $G$-action is tautological, and the cocycle condition holds automatically. The attention score is the negative squared algebra norm, $s_{ij} = -\|\log(g_i^{-1} g_j)\|_\lambda^2/\tau$: the canonical proximity kernel under a block-weighted Frobenius inner product, with no irreducible representations, spherical harmonics, Clebsch-Gordan products, or learned kernel. The construction applies to any matrix Lie group on a chosen logarithm chart containing the relative poses, including the non-compact non-abelian affine groups with scale and shear that no vector-token attention method reaches: neither the irrep tradition nor surjective-exp methods. Three sequence-completion experiments, on SE(2), SO(3), and Aff(2), bear this out: the closed-form score matches a learned MLP kernel on the same invariant and outperforms it on SE(2), using 50 to 80x fewer score parameters, while a vector-token baseline breaks invariance by five to twelve orders of magnitude.

0

cs.CV 2026-06-19

One image yields synthetic data matching real detector performance

by Keqin Zeng, Shuting Su +3 more

One Image is All You Need: Agentic One-Shot Image Generation via Text-Based World Models for Long-Tail Spatial Perception

Text-based world model expands a single reference into physically constrained long-tail scenes for training spatial perception models.

abstract click to expand

Reliable spatial decision automation, such as autonomous driving and maritime surveillance, critically depends on robust visual perception. However, real-world spatiotemporal data exhibits severe heterogeneity, often manifesting as extreme long-tail distributions for safety-critical scenarios. This data scarcity induces dataset shift that degrades detection performance and pose safety risks. While synthetic data generation offers a potential solution, existing generative approaches, such as diffusion models and Generative Adversarial Networks (GANs), often lack explicit spatial grounding and structural constraints, resulting in spatial and physical inconsistencies in generated scenes. To address these challenges, we introduce WMGen-v1, an agentic text-based world model framework for long-tail spatial data generation. WMGen-v1 employs a Large Vision-Language Model (LVLM) to construct a structured scene representation from a single reference image, while a Large Language Model (LLM) performs guidance-based scene expansion under physical plausibility and commonsense constraints. Subsequently, conditioned on the structured semantic representations produced by this reasoning process, a diffusion model generates diverse and physically grounded long-tail training data. Experiments on internal industrial datasets, ROADWork, and LaRS benchmarks demonstrate that WMGen-v1 outperforms baseline approaches. Notably, detectors trained solely on WMGen-v1 synthetic data approach real-only performance on aggregate dataset-level metrics, highlighting its potential to alleviate long-tail data scarcity for downstream spatial perception.

0

cs.CV 2026-06-19

TriFlow turns geometry inputs into artist-like 3D meshes via vector fields

by Haoxuan Li, Ziya Erkoç +5 more

TriFlow: Generating Artist-Like 3D Mesh Topology via Nearest-Vertex Vector Fields

A flow-matching model creates nearest-vertex fields that cluster into clean topologies, cutting error by 90 percent and running eight times

abstract click to expand

We present TriFlow, a new generative approach for producing compact 3D meshes with artist-like triangle topology directly from input geometry conditions such as signed distance fields. Our key insight is to represent mesh topology as a nearest-vertex vector field (NVF) defined over the surface, where each point encodes its association to the nearest triangle vertex in the local barycentric frame. We train a latent flow-matching model to synthesize this field, enabling topology generation conditioned on the input geometry. To extract a coherent mesh, we cluster surface regions using the generated NVF and guide a constrained quadric error metric (QEM) mesh simplification with topology-aware optimization. This yields output meshes that closely match the input geometry while exhibiting structured, artist-like connectivity. Experiments demonstrate that TriFlow achieves stronger generalization and significantly improved topology quality compared to state-of-the-art learning-based approaches, alongside 90% lower Chamfer Distance and an 8x speedup.

0