pith. sign in

arxiv: 2511.16624 · v1 · submitted 2025-11-20 · 💻 cs.CV · cs.AI

SAM 3D: 3Dfy Anything in Images

Pith reviewed 2026-05-11 11:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D reconstructionsingle imagegenerative modelobject shapeannotation pipelinenatural imagescomputer vision3D ground truth
0
0 comments X

The pith

SAM 3D generates geometry, texture and layout of objects from one natural image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SAM 3D as a generative model that reconstructs 3D objects with shape, surface appearance and spatial arrangement from a single photograph of a real scene. To overcome scarce training data, the authors built a pipeline that interleaves human judgment and model predictions to label object shape, texture and pose across many natural images containing occlusions and clutter. Training proceeds in stages: initial learning on synthetic examples followed by alignment to the real annotated set. The resulting model produces outputs that human evaluators prefer to those of earlier methods at a rate of at least five to one.

Core claim

SAM 3D is a generative model for visually grounded 3D object reconstruction that predicts geometry, texture, and layout from a single image. It is trained on a large collection of such reconstructions obtained through a human- and model-in-the-loop annotation pipeline, using synthetic pretraining followed by real-world alignment to break the previous data barrier for natural images.

What carries the argument

The human- and model-in-the-loop annotation pipeline that supplies accurate 3D ground truth for natural images with occlusion and clutter, which in turn supports the multi-stage training of the generative reconstruction model.

If this is right

  • Superior 3D reconstruction quality on cluttered natural images compared with earlier single-image methods.
  • At least a 5:1 preference margin in blind human evaluations on real-world objects and scenes.
  • Public release of code, trained weights, an interactive demo, and a new benchmark dataset for in-the-wild 3D reconstruction.
  • Demonstration that combining synthetic pretraining with large-scale real annotations overcomes the data limitation for 3D vision tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation loop could be applied to video sequences to obtain consistent 3D models over time.
  • The released benchmark could become a standard test set that future single-image 3D methods must surpass.
  • Scaling the pipeline further might support training models that reconstruct entire scenes rather than isolated objects.

Load-bearing premise

The human- and model-in-the-loop annotation pipeline produces accurate, unbiased 3D ground truth at scale for natural images with occlusion and clutter.

What would settle it

If human preference tests on real-world objects and scenes show no clear advantage or a reversal of the reported 5:1 win rate for SAM 3D over recent prior work, the performance claim would not hold.

read the original abstract

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents SAM 3D, a generative model for single-image 3D reconstruction of objects that outputs geometry, texture, and layout. It introduces a scalable human- and model-in-the-loop annotation pipeline to generate 3D ground-truth data for natural images with occlusion and clutter, trains via synthetic pretraining followed by real-world alignment, and reports at least a 5:1 win rate in human preference tests over prior work on real-world scenes. Code, weights, demo, and a new benchmark are to be released.

Significance. If the data quality and performance claims hold, the work would meaningfully advance in-the-wild 3D reconstruction by addressing the scarcity of accurate 3D annotations for natural images, enabling stronger context-aware models; the planned public releases would further support reproducibility and downstream applications in vision and graphics.

major comments (3)
  1. [Abstract] Abstract: the headline claim of 'at least a 5:1 win rate in human preference tests' is presented without any quantitative metrics (e.g., number of raters, total comparisons, confidence intervals, or inter-rater agreement), error analysis, or evaluation protocol details, which are required to assess whether the result supports the central performance assertion.
  2. [Annotation pipeline] Annotation pipeline description (likely §3–4): the human-model-in-the-loop process is asserted to produce accurate, unbiased 3D ground truth at scale for occluded natural images, yet no external validation (multi-view consistency checks, depth-sensor comparisons, or laser-scan ground truth on a held-out set) is reported; because downstream synthetic-pretrain + real-alignment stages and the 5:1 preference result depend directly on this data fidelity, the absence of such verification is load-bearing.
  3. [Evaluation] Evaluation section: the human preference tests are the sole quantitative evidence offered for superiority over recent work, but without objective metrics (e.g., Chamfer distance, IoU on reconstructed meshes, or pose accuracy on a standard benchmark) or ablation isolating the contribution of the new data versus the training schedule, it is difficult to determine whether the gains are robust or artifactual.
minor comments (2)
  1. [Abstract and Introduction] The abstract and introduction would benefit from a concise statement of the precise architectural differences from prior single-image 3D methods (e.g., explicit comparison to recent diffusion-based or NeRF-based baselines).
  2. [Figures] Figure captions and legends should include more detail on what is being visualized (e.g., input image, predicted mesh, texture map, and any overlaid ground-truth annotations) to aid readers in interpreting qualitative results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with clear indications of planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of 'at least a 5:1 win rate in human preference tests' is presented without any quantitative metrics (e.g., number of raters, total comparisons, confidence intervals, or inter-rater agreement), error analysis, or evaluation protocol details, which are required to assess whether the result supports the central performance assertion.

    Authors: The abstract is intentionally concise and summarizes the primary result, while the full quantitative details—including the number of raters, total comparisons, confidence intervals, inter-rater agreement, error analysis, and evaluation protocol—are provided in the Evaluation section. To address the concern, we will revise the abstract to include a brief reference to the human study scale and key supporting statistics from the main text. revision: yes

  2. Referee: [Annotation pipeline] Annotation pipeline description (likely §3–4): the human-model-in-the-loop process is asserted to produce accurate, unbiased 3D ground truth at scale for occluded natural images, yet no external validation (multi-view consistency checks, depth-sensor comparisons, or laser-scan ground truth on a held-out set) is reported; because downstream synthetic-pretrain + real-alignment stages and the 5:1 preference result depend directly on this data fidelity, the absence of such verification is load-bearing.

    Authors: We agree that additional external validation would increase confidence in the data fidelity. The pipeline incorporates internal model-assisted consistency checks during annotation. In the revised manuscript, we will add a validation subsection that reports multi-view consistency metrics and comparisons to depth-sensor data on a held-out set to directly address this point. revision: yes

  3. Referee: [Evaluation] Evaluation section: the human preference tests are the sole quantitative evidence offered for superiority over recent work, but without objective metrics (e.g., Chamfer distance, IoU on reconstructed meshes, or pose accuracy on a standard benchmark) or ablation isolating the contribution of the new data versus the training schedule, it is difficult to determine whether the gains are robust or artifactual.

    Authors: Human preference evaluation is the most appropriate primary metric for assessing perceptual quality of 3D reconstructions in natural, cluttered scenes. We will nevertheless expand the Evaluation section to include objective metrics such as Chamfer distance and pose accuracy on standard benchmarks, along with ablations that separate the contributions of the new annotated data from the training schedule. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical data generation, training, and external human evaluation.

full rationale

The paper presents an annotation pipeline to generate 3D data at scale from natural images, followed by multi-stage training of a generative model and evaluation via human preference tests. No equations, derivations, or self-referential definitions are present in the provided text that would reduce any prediction or result to fitted inputs or prior outputs by construction. The data pipeline, training framework, and preference-based evaluation are described as sequential and externally validated steps without load-bearing self-citations or ansatzes that collapse the logic. This is a standard empirical ML paper structure with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim depends on the annotation pipeline yielding high-quality 3D labels and on the assumption that synthetic pretraining plus real alignment suffices to overcome data scarcity.

free parameters (1)
  • model training hyperparameters
    Standard deep learning parameters tuned on the new dataset.
axioms (1)
  • domain assumption Single natural images contain sufficient visual cues for accurate 3D object reconstruction despite occlusion and clutter
    Invoked implicitly as the basis for the task and evaluation on real-world scenes.

pith-pipeline@v0.9.0 · 5543 in / 1095 out tokens · 94918 ms · 2026-05-11T11:41:46.946357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. One Video, One World: Turning Monocular Video into Physical 4D Scenes

    cs.CV 2026-06 unverdicted novelty 8.0

    OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

  2. Is It Real? Exploiting Virtual-Physical Discrimination Vulnerability in Mixed Reality

    cs.HC 2026-06 unverdicted novelty 8.0

    Mixed reality headsets have a virtual-physical discrimination vulnerability that can be exploited to alter user behavior, demonstrated through four proof-of-concept attacks on Apple Vision Pro with 85-100% success rat...

  3. ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

    cs.CV 2026-04 unverdicted novelty 8.0

    ViPS distills a compact, controllable distribution of valid joint configurations for any auto-rigged mesh from video diffusion priors, matching 4D-trained methods in plausibility while generalizing zero-shot to unseen...

  4. neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing

    cs.CV 2026-04 unverdicted novelty 8.0

    neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.

  5. WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

    cs.CV 2026-06 unverdicted novelty 7.0

    WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

  6. DreamUV: Unwrap Artist-like UV by End-to-End Flow Matching

    cs.CV 2026-06 unverdicted novelty 7.0

    DreamUV uses end-to-end flow matching to generate UV parameterizations that match stylistic patterns in professionally authored layouts rather than purely minimizing geometric distortion.

  7. Thinking in Boxes: 3D Editing in Real Images Made Easy

    cs.CV 2026-06 unverdicted novelty 7.0

    A method that treats 3D box pairs as exact transformation specs, adds a depth-aware floor reference, and trains an image generator on synthetic scenes plus Objectron videos to perform large 3D edits on real photographs.

  8. Human Universal Grasping

    cs.RO 2026-06 unverdicted novelty 7.0

    HUG trains a flow-matching model on a new 1M-frame egocentric human grasp dataset to generate retargetable grasps from single RGB-D images, beating baselines by 23-34% on a new 90-object benchmark.

  9. World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

    cs.CV 2026-06 unverdicted novelty 7.0

    World Tracing introduces a multi-layer pixel-aligned 3D point representation instantiated via a diffusion transformer (WT-DiT) trained with pixel-space flow matching to jointly reconstruct visible surfaces and generat...

  10. EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

    cs.RO 2026-06 unverdicted novelty 7.0

    EgoEngine transforms egocentric human videos into high-fidelity robot data enabling zero-shot visuomotor dexterous policy learning without real-robot demonstrations.

  11. SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

    cs.GR 2026-06 unverdicted novelty 7.0

    SymTRELLIS enforces finite point-group symmetries during TRELLIS.2 generation via a learned linear latent-space mapper and velocity symmetrization, reducing symmetry errors on a 266-object benchmark while preserving r...

  12. Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models

    cs.CV 2026-06 unverdicted novelty 7.0

    SEIG uses staged VLM prompting to output executable Blender programs that reconstruct editable 3D scenes from single images, showing improved fidelity over non-staged baselines.

  13. REST3D: Reconstructing Physically Stable 3D Scenes from a Single Image

    cs.CV 2026-05 unverdicted novelty 7.0

    REST3D reconstructs physically stable 3D scenes from single images via agentic scene-tree understanding and physics-constrained optimization.

  14. Geometry Matters: 3D Foundation Priors for Learning Semantic Correspondence

    cs.CV 2026-05 unverdicted novelty 7.0

    A 3D-aware framework uses SAM3D geometry and pose estimation plus geodesic filtering to supervise a lightweight adapter on DINO and Stable Diffusion features, improving semantic correspondence with less manual supervision.

  15. CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

    cs.CV 2026-05 unverdicted novelty 7.0

    CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

  16. Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

    cs.CV 2026-05 unverdicted novelty 7.0

    Stream3D is a training-free method that maintains temporal consistency in 3D generation from monocular streams by dynamically caching a fixed number of informative historical frames using an evidence score.

  17. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

    cs.AI 2026-05 unverdicted novelty 7.0

    SceneCode compiles natural language prompts into executable code programs that generate editable, articulated indoor scenes for physics simulation.

  18. MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design

    cs.HC 2026-05 unverdicted novelty 7.0

    MiXR enables in-situ 3D design by harvesting real-world geometry for user-defined compositions that generative AI then refines, outperforming text-only generative methods in control and fidelity per a 12-person study.

  19. OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

    cs.RO 2026-04 unverdicted novelty 7.0

    A 48-camera residential platform delivers real-time occlusion-robust 3D perception and coordinated actuation for multi-human multi-robot interaction in a shared home workspace.

  20. DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

    cs.CV 2026-04 unverdicted novelty 7.0

    DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.

  21. LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

    cs.CV 2026-04 unverdicted novelty 7.0

    LEXIS-Flow uses VQ-VAE-learned interaction signatures to guide diffusion-based reconstruction of 3D human-object meshes and dense proximity fields from single RGB images, outperforming SOTA on benchmarks.

  22. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.

  23. RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 7.0

    RoboWM-Bench evaluates video world models by converting their manipulation video predictions into executable actions validated in simulation, showing that visual plausibility does not guarantee physical executability.

  24. ViPS: Video-informed Pose Spaces for Auto-Rigged Meshes

    cs.CV 2026-04 unverdicted novelty 7.0

    ViPS learns a universal, controllable pose space for auto-rigged meshes by transferring motion priors from video diffusion models, matching SOTA performance on plausibility and diversity while enabling zero-shot gener...

  25. Physically Grounded 3D Generative Reconstruction under Hand Occlusion using Proprioception and Multi-Contact Touch

    cs.CV 2026-04 unverdicted novelty 7.0

    A conditional diffusion model using proprioception and multi-contact touch produces metric-scale, physically consistent 3D object reconstructions under hand occlusion.

  26. WildDet3D: Scaling Promptable 3D Detection in the Wild

    cs.CV 2026-04 unverdicted novelty 7.0

    WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

  27. Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

  28. THOM: Generating Physically Plausible Hand-Object Meshes From Text

    cs.CV 2026-04 unverdicted novelty 7.0

    THOM is a training-free two-stage framework that generates physically plausible hand-object 3D meshes directly from text by combining text-guided Gaussians with contact-aware physics optimization and VLM refinement.

  29. Simulation-Ready Cluttered Scene Estimation via Physics-aware Joint Shape and Pose Optimization

    cs.RO 2026-02 unverdicted novelty 7.0

    SPARCS uses a differentiable contact model and sparse Hessian solver to jointly optimize shapes and poses of up to five interacting objects, producing physically valid simulation-ready reconstructions.

  30. Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

    cs.CV 2026-01 conditional novelty 7.0

    VIGA introduces a training-free interleaved multimodal reasoning loop that improves vision-as-inverse-graphics accuracy over one-shot baselines on BlenderGym, SlideBench, and new BlenderBench.

  31. HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control

    cs.CV 2026-07 unverdicted novelty 6.0

    HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.

  32. ShellMaker: Language-Guided Exterior Completion under Structural Constraints

    cs.CV 2026-06 unverdicted novelty 6.0

    ShellMaker generates complete building exteriors from scaffolds and style prompts via parametric roofs, LLM prompt refinement, material retrieval, and geometry-aware assembly while preserving structural constraints.

  33. GROW$^2$: Grounding Which and Where for Robot Tool Use

    cs.RO 2026-06 unverdicted novelty 6.0

    GROW² hierarchically grounds open-world tool affordances by using VLMs for semantic selection of objects and parts followed by geometric localization with vision foundation models.

  34. SimFoundry: Modular and Automated Scene Generation for Policy Learning and Evaluation

    cs.RO 2026-06 unverdicted novelty 6.0

    SimFoundry automates zero-shot real-to-sim scene generation from video, producing digital twins and cousins that enable policy training with 0.911 mean Pearson correlation to real-world results and 17-40% success gain...

  35. HAT-4D: Lifting Monocular Video for 4D Multi-Object Interactions via Human-Agent Collaboration

    cs.CV 2026-06 unverdicted novelty 6.0

    HAT-4D presents an agentic VLM-plus-human-in-the-loop pipeline for monocular 4D multi-object interaction reconstruction and releases the MVOIK-4D benchmark.

  36. $\phi$-Scene: Physically Grounded Image-to-3D Scene Reconstruction

    cs.CV 2026-06 unverdicted novelty 6.0

    φ-Scene performs image-to-3D scene reconstruction via topology-driven physical assembly that resolves penetrations with SDF optimization and settles objects with rigid-body simulation.

  37. Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

    cs.RO 2026-06 unverdicted novelty 6.0

    DO AS I DO reconstructs and retargets hand-object interactions from in-the-wild monocular RGB videos to produce dexterous robot manipulation trajectories, outperforming prior methods on ground-truth and online video datasets.

  38. EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

    cs.RO 2026-06 unverdicted novelty 6.0

    EgoInfinity is a modular pipeline that lifts in-the-wild RGB videos into agent-agnostic 4D hand-object data with interaction-aware refinement and retargets motions to diverse robot morphologies for video-to-action learning.

  39. Modality Forcing for Scalable Spatial Generation

    cs.CV 2026-06 unverdicted novelty 6.0

    Modality Forcing lets a single DiT produce image and depth outputs in any order after training on sparse real-world depth, with larger image-pretrained models yielding better depth accuracy and a 57% AbsRel reduction ...

  40. Surflo: Consistent 3D Surface Flow Model with Global State

    cs.CV 2026-06 unverdicted novelty 6.0

    Surflo compresses unposed RGB views into K global latent tokens and uses flow matching with photometric guidance to decode consistent arbitrary-resolution 3D surface points in one forward pass.

  41. Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

    cs.RO 2026-06 unverdicted novelty 6.0

    Video2Sim2Real turns a single human video into a deployable robot manipulation skill by reconstructing a digital twin, anchoring motions to object-centric simulator configurations, and bridging sim-to-real gaps with i...

  42. EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

    cs.RO 2026-06 unverdicted novelty 6.0

    EgoAERO reconstructs contact-consistent hand-object trajectories from single egocentric RGB-D videos without object assets via asset-free tracking and adaptive optimization, then trains robot policies with two-stage r...

  43. HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

    cs.CV 2026-06 unverdicted novelty 6.0

    A hierarchical pipeline generates controllable whole-home 3D scenes from floorplans via LLMs, image models, and VLMs, releasing 300K floorplans and 5K scenes for embodied AI use.

  44. AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

    cs.RO 2026-06 unverdicted novelty 6.0

    AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.

  45. SimuScene: Simulation-Ready Compositional 3D Scene Reconstruction from a Single Image

    cs.CV 2026-06 unverdicted novelty 6.0

    SimuScene feeds physics simulation diagnostics back into shape and layout estimation to correct geometric errors and output simulation-ready compositional scenes from single images.

  46. GARDEN: Gravity-Aligned Reconstruction of Disentangled ENvironments from RGB images

    cs.CV 2026-06 unverdicted novelty 6.0

    GARDEN uses gravity alignment and conditional 3D point classification to factorize RGB reconstructions into explicit rigid bodies plus decoupled background for direct physics simulation.

  47. MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents

    cs.CV 2026-06 unverdicted novelty 6.0

    MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.

  48. LEGS: Fine-Tuning Teleop-Free VLAs for Humanoid Loco-manipulation in an Embodied Gaussian Splatting World

    cs.RO 2026-05 unverdicted novelty 6.0

    LEGS shows synthetic data from a 3DGS-mesh hybrid simulator trains VLA policies for humanoid pick-and-place that match or exceed human teleoperation performance across multiple backbones and tasks while enabling low-c...

  49. AnyScene: Towards Highly Controllable Driving Scene Generation at Anywhere and Beyond

    cs.RO 2026-05 unverdicted novelty 6.0

    AnyScene is an occupancy-centric framework using a Spatial-Temporal Occupancy Diffusion Transformer and Geometry-Grounded View Expansion to generate controllable driving scenes and videos from BEV layouts.

  50. Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation

    cs.CV 2026-05 unverdicted novelty 6.0

    Introduces Layout-as-Policy (LaP) to turn 3D layout estimation into an iterative policy-learning refinement process for better physical coherence.

  51. Learning to Evolve: Multi-modal Interactive Fields for Robust Humanoid Navigation in Dynamic Environments

    cs.RO 2026-05 unverdicted novelty 6.0

    MIF integrates appearance, spatial, and geometry fields with discrepancy detection to raise humanoid relocation success from 12% to 94% in dynamic offices while cutting memory use by 91.4%.

  52. Stream3D: Sequential Multi-View 3D Generation via Evidential Memory

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream3D is a training-free method that maintains a fixed-size evidential memory of past frames to convert frozen view-conditioned 3D generators into consistent streaming generators.

  53. SUGAR: A Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework

    cs.RO 2026-05 unverdicted novelty 6.0

    SUGAR turns diverse human videos into deployable humanoid loco-manipulation policies via automated prior extraction, physics refinement, and hierarchical distillation, showing scaling with data volume and zero-shot re...

  54. Focusable Monocular Depth Estimation

    cs.CV 2026-05 unverdicted novelty 6.0

    FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.

  55. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  56. MiXR: Harvesting and Recomposing Geometry from Real-World Objects for In-Situ 3D Design

    cs.HC 2026-05 unverdicted novelty 6.0

    MiXR enables in-situ 3D compositional modeling by harvesting real-world geometry in XR and using generative AI to synthesize coherent models from user-defined assemblies.

  57. ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings

    cs.CV 2026-05 unverdicted novelty 6.0

    ClickSeg3D uses a point Transformer encoder and hierarchical mask decoder with semantic embeddings to enable single-pass multi-object 3D interactive segmentation from sparse points, reporting over 20% mIoU gains versu...

  58. ClickSeg3D: Few-Click Interactive Segmentation via Semantic Embeddings

    cs.CV 2026-05 unverdicted novelty 6.0

    A point-Transformer interactive 3D instance segmentation model handles multiple clicks jointly in one pass and reports over 20% mIoU gains versus baselines plus 8-10% cross-dataset improvement for one-click-per-instan...

  59. Creative Robot Tool Use by Counterfactual Reasoning

    cs.RO 2026-05 unverdicted novelty 6.0

    Robots discover causal tool features through VLM suggestions and physics-based counterfactual perturbations in simulation, then transfer manipulation skills via conditioned keypoint matching.

  60. Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations

    cs.CV 2026-04 unverdicted novelty 6.0

    RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 98 Pith papers · 17 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin et al. Phi-3 technical report: A highly capable language model locally on your phone, 2024.https: //arxiv.org/abs/2404.14219. Thomas Anthony, Zheng Tian, and David Barber. Thinking fast and slow with deep learning and tree search. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 5366–53...

  2. [2]

    Curriculum learning , booktitle =

    Association for Computing Machinery. ISBN 9781605585161. doi: 10.1145/1553374.1553380.https://doi.org/10. 1145/1553374.1553380. Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision (ECCV),

  3. [3]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository.arXiv preprint arXiv:1512.03012,

  5. [5]

    Annals of Operations Research 134, 19–67

    ISSN 0254-5330. doi: 10.1007/s10479-005-5724-z. Paul E Debevec, Camillo J Taylor, and Jitendra Malik. Modeling and rendering architecture from photographs: A hybrid geometry-and image-based approach. InSeminal Graphics Papers: Pushing the Boundaries, Volume 2,

  6. [6]

    Emerging Properties in Unified Multimodal Pretraining

    Yuxuan Deng, Yujia Zhu, Jiahui Chen, Yuan Wang, Yifei Li, Haotian Li, Junnan Li, Jinsheng Zhang, Wenhui Liu, Yuzheng Zhang, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  7. [7]

    Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang

    doi: 10.1509/jmkr.47.2.312.https://doi.org/10.1509/jmkr.47.2.312. Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment.Transactions on Machine Learning Research,

  8. [8]

    3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787, 2025

    Dylan Ebert. 3d arena: An open platform for generative 3d evaluation.arXiv preprint arXiv:2506.18787,

  9. [9]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557,

  10. [10]

    arXiv preprint arXiv:2509.07978 (2025)

    Zheng Geng, Nan Wang, Shaocong Xu, Chongjie Ye, Bohan Li, Zhaoxi Chen, Sida Peng, and Hao Zhao. One view, many worlds: Single-image to 3d object meets generative domain randomization for one-shot 6d pose estimation. arXiv preprint arXiv:2509.07978,

  11. [11]

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation.CoRR, abs/1311.2524, 2013.http://arxiv.org/abs/1311.2524. Georgia Gkioxari, Jitendra Malik, and Justin Johnson. Mesh r-cnn. InProceedings of the IEEE/CVF international conference on computer vision, pages 9785–9795,

  12. [12]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models, 2024.https://arxiv.org/abs/2407.21783. Kristen Grauman, Andrew Westbury, et al. Ego4d: Around the world in 3,000 hours of egocentric video.International Journal of Computer Vision (IJCV),

  13. [13]

    Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives

    Kristen Grauman et al. Ego-exo4d: Understanding skilled human activity from first- and third-person perspectives. arXiv preprint arXiv:2401.10889,

  14. [14]

    Reinforced Self-Training (ReST) for Language Modeling

    Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling, 2023.https://arxiv.org/abs/2308.08998. Agrim Gupta, Piotr Dollar, and Ross...

  15. [15]

    Scaling Laws for Transfer

    Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

  16. [16]

    Hunyuan3D 2.1: From Images to High-Fidelity 3D Assets with Production-Ready PBR Material

    Team Hunyuan3D, Shuhui Yang, Mingxin Yang, Yifei Feng, Xin Huang, Sheng Zhang, Zebin He, Di Luo, Haolin Liu, Yunfei Zhao, et al. Hunyuan3d 2.1: From images to high-fidelity 3d assets with production-ready pbr material. arXiv preprint arXiv:2506.15442,

  17. [17]

    Category-specific object reconstruction from a single image

    Abhishek Kar, Shubham Tulsiani, Joao Carreira, and Jitendra Malik. Category-specific object reconstruction from a single image. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1966–1974,

  18. [18]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  19. [19]

    arXiv preprint arXiv:2212.06870 (2022)

    Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, and Josef Sivic. Megapose: 6d pose estimation of novel objects via render & compare. arXiv preprint arXiv:2212.06870,

  20. [20]

    arXiv preprint arXiv:2403.13787 , year=

    Nathan Lambert.Reinforcement Learning from Human Feedback. Online, 2025.https://rlhfbook.com. Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling, 2024.https://arxi...

  21. [21]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Yanghao Li, Haoqi Fan, Rohit Girdhar, and Alexander Kirillov. Segment anything in videos.arXiv preprint arXiv:2305.06500, 2023a. Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report, 2023b.https://arxiv.org/abs/2309.05463. Weixin Liang, LILI YU, Liang Luo, S...

  22. [22]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

  23. [23]

    Aria Everyday Activities Dataset.arXiv preprint arXiv:2402.13349, 2024

    Zhaoyang Lv, Nicholas Charron, Pierre Moulon, Alexander Gamino, Cheng Peng, Chris Sweeney, Edward Miller, Huixuan Tang, Jeff Meissner, Jing Dong, et al. Aria everyday activities dataset.arXiv preprint arXiv:2402.13349,

  24. [24]

    Karen Liu, Ziwei Liu, Jakob Engel, Renzo De Nardi, and Richard Newcombe

    https://arxiv.org/abs/2406.09905. Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4040–4048,

  25. [25]

    Large Language Models: A Survey

    16 Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2025.https://arxiv.org/abs/2402.06196. Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, and Anxiang Zeng. Mid-training of large language models: A survey, 2025.https://arxiv.org...

  26. [26]

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Ji...

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

  28. [28]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback....

  29. [29]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nico...

  30. [31]

    arXiv preprint arXiv:2406.10224 , year =

    https://arxiv.org/abs/2406.10224. Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2974–2983,

  31. [32]

    arXiv preprint arXiv:2405.08448 , year=

    Yunhao Tang, Daniel Zhaohan Guo, Zeyu Zheng, Daniele Calandriello, Yuan Cao, Eugene Tarassov, Rémi Munos, Bernardo Ávila Pires, Michal Valko, Yong Cheng, and Will Dabney. Understanding the performance gap between online and offline alignment algorithms.arXiv preprint arXiv: 2405.08448,

  32. [33]

    arXiv preprint arXiv:2506.20512 , year=

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025a. Zengzhi Wang, Fan Zhou, Xuefeng Li, and Pengfei Liu. O...

  33. [34]

    arXiv preprint arXiv:2505.17412 (2025)

    Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Philip Torr, Xun Cao, and Yao Yao. Direct3d-s2: Gigascale 3d generation made easy with spatial sparse attention.arXiv preprint arXiv:2505.17412,

  34. [35]

    Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

    Hu Xu, Nikhila Goyal, Mitchell Wortsman, Gabriel Ilharco, Ozan Sener, Aniruddha Kembhavi, Ali Farhadi, and Rohit Girdhar. Metaclip: How to make clip efficiently.arXiv preprint arXiv:2404.07143,

  35. [36]

    Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xinzhou Wang, Qingxiang Lin, Jiaao Yu, et al

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10371–10381, 2024a. Xianghui Yang, Huiwen Shi, Bowen Zhang, Fan Yang, Jiacheng Wang, Hongxu Zhao, Xinhai Liu, Xin...

  36. [37]

    Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

    Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and Jingren Zhou. Scaling relationship on learning mathematical reasoning with large language models, 2023.https://arxiv.org/abs/ 2308.01825. Biao Zhang, Jiapeng Tang, Matthias Niessner, and Peter Wonka. 3dshape2vecset: A 3d shape representation for neural fields and...

  37. [38]

    arXiv preprint arXiv:2310.06773 , year=

    Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3d: Exploring unified 3d representation at scale.arXiv preprint arXiv:2310.06773,

  38. [39]

    Left” (is better), “Right

    19 Appendix Outline The appendix provides additional context to the main paper; it contains additional details about the method and the implementation in SAM 3D, as well as ablations. The structure of the appendix is as follows: (i) Data Engine details:A more detailed description the data collection used in thecollection stepin Section 3.2.1. (ii) Pretrai...

  39. [40]

    speed of the data flywheel

    as a tool to assist in segmentation. • Stage 2: Annotators on average spend 80 seconds to select the best candidate shape/texture from 6-10 candidate meshes from variable sources. • Stage 3: Annotators on average spend 150 seconds to anchor and orient the matched 3D shape to the 2.5D point cloud. 23 Algorithm 1SAM 3D Basic Alignment (Texture, Shape) Requi...

  40. [41]

    push into the tail

    is to align the model to match human preference on the distribution ofallpossible real-world objects. The core algorithm in our data engine generates samples by asking humans to select viable samples from a set of candidate generations. Challenging inputs often result in no viable candidate generations and thus never get selected by humans. However, at an...

  41. [42]

    This linearly scales the annotation time of preference data collection, and the selections themselves become noisier and more random due to choice overload (Diehl and Poynor, 2010)

    However, the primary impediment to increasingN is that, at some point, there are too many choices for a human to compare. This linearly scales the annotation time of preference data collection, and the selections themselves become noisier and more random due to choice overload (Diehl and Poynor, 2010). 25 Failure data Generate 50 seeds VLM tournament rank...

  42. [43]

    or related self-training methods. Under this interpretation, the generative model q is a policy and the data collection step is a policy evaluation; collecting demonstrationsD+ and preferences D+/D− through the interaction with the environment (annotators). The model improvement step simply updates the current policy using both finetuning and DPO. This re...

  43. [44]

    and RFT (Yuan et al., 2023), although the alignment algorithm in SAM 3D adds explicit expert policies/ensembles and leverages preference supervision. B Pretraining and Mid-Training Data Details B.1 Iso-3DO Data Filtering For the Iso-3DO data used for pretraining, the quality of the 3D meshes can vary substantially, and not all samples exhibit high-fidelit...

  44. [45]

    de-lighted

    and FlyingThings3D (Mayer et al., 2016), we name our first variant Flying Occlusions, reflecting its use of freely inserted synthetic objects. Each training example consists of a natural image onto which we composite two rendered 3D objects: an occluderand anoccludee. For each pair, we also compute the final visible mask of the occludee after occlusion. T...

  45. [46]

    shape-only), and when freezing shape capabilities and finetuning just for layout

    This proves helpful when training on datasets that contain labels for only one modality (e.g. shape-only), and when freezing shape capabilities and finetuning just for layout. At the same time, MoT still allows for information sharing during the forward pass, through the joint self-attention layers for cross-modal interaction. This shared context is criti...

  46. [47]

    Implementation details.We apply DPO on shape prediction in the Geometry model and the predictions of the Texture & Refinement model

    logσ −βT w(τ)·∆ (2) (3) where∆ =∥v w −v θ(xw τ , c, τ)∥2 2 − ∥vw −v ref(xw τ , c, τ)∥2 2 − ∥vl −v θ(xl τ , c, τ)∥2 2 − ∥vl −v ref(xl τ , c, τ)∥2 2 wherev w andv l are the target flow-matching velocities forxw τ and xl τ, andv θ,v ref are the learned and frozen reference velocity fields, respectively. Implementation details.We apply DPO on shape prediction...

  47. [48]

    de-lighted rendering

    The final model is fine-tuned for approximately4K iterations using the same objective as in Frans et al. (2024):75%flow matching and25%shortcut. When shortcut mode is disabled, the model behaves identically to the original flow matching model. We initialize the step size embedder by setting the weights and bias of its final linear layer to zero, since, un...

  48. [49]

    C.6 Texture & Refinement VAE We make improvements over the original SLAT VAE design in Xiang et al

    We follow the same training objective described in Equation (4). C.6 Texture & Refinement VAE We make improvements over the original SLAT VAE design in Xiang et al. (2025), where features are back-projected to all voxels, including those that are not visible (i.e., occluded) from the current image. This original design choice leads to reduced sharpness in...

  49. [50]

    Many rely on synthetic datasets (Deitke et al., 2023; Chang et al.,

    D Evaluation Current evaluation benchmarks for visually grounded 3D object reconstruction fall short of capturing the complexity of the real world. Many rely on synthetic datasets (Deitke et al., 2023; Chang et al.,

  50. [51]

    This introduces a large visual gap with real-world evaluation conditions and the rich variation of real-world imagery

    where single objects are rendered in isolation, centered against a white background. This introduces a large visual gap with real-world evaluation conditions and the rich variation of real-world imagery. Efforts to move to real data mostly focus on indoor environments (Khanna et al., 2024; Sun et al., 2018; Pan et al., 2023), but these benchmarks heavily ...

  51. [52]

    A” vs. “B

    D.2 Human Preference Set We further expand our evaluation suite to support more rigorous and domain-targeted assessments. While SA-3DAO provides a general and standardized way to measure progress, we want to also capture the challenges of settings where 3D perception is most critical, such as robotic manipulation and egocentric vision. To address this, we...

  52. [53]

    and Uni3D (Zhou et al., 2023). For each generated mesh, we uniformly sample8, 192surface points to form a point cloud representation, and compute cross-modal similarity between the point cloud features and image features. D.3.2 Layout Metrics Definitions To evaluate single-object pose and compare with existing methods, we employ standard 6D pose estimatio...

  53. [54]

    In this case, when the predicted and ground truth shape are the same, the asymmetric and symmetric versions of ADD-S coincide

    was designed for 6DoF pose estimation using a ground truth CAD model. In this case, when the predicted and ground truth shape are the same, the asymmetric and symmetric versions of ADD-S coincide. In SAM 3D we jointly estimate shape and pose, so generalize the metric to the symmetric version. • ADD-S @ 0.1:A binary value per-sample indicating whether the ...

  54. [55]

    and Hunyuan3D-2.1 (Hunyuan3D et al., 2025). We also conduct a texture-only comparison by providing SAM 3D geometry as input to the texture modules of the aforementioned baselines, with the addition of Unitex (Liang et al., 2025b), a model that performs texture prediction given paired image and shape input. We report human preference for SAM 3D over each b...

  55. [56]

    proposals

    using annotator preferences on the Pref Set. We benchmark each component to the alternative model without the change. We remark a few themes here: • Augmentation is very important, with lighting augmentation to be the most critical here. This is expected, given the Mask and Blur augmentations primarily focus on specific challenging cases (poor mask qualit...

  56. [57]

    Further applying normalization to the 6D rotation vectors using the statistics over the training datasets leads to an additional improvement when training the flow matching models

    yields a notable reduction in oriented rotation error, confirming that the 6D formulation provides a smoother optimization landscape more suitable for generative modeling. Further applying normalization to the 6D rotation vectors using the statistics over the training datasets leads to an additional improvement when training the flow matching models. Repr...