Objaverse: A Universe of Annotated 3D Objects
read the original abstract
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.
This paper has not been read by Pith yet.
Forward citations
Cited by 30 Pith papers
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image
UnfoldArt uses a two-round structured debate between high-level semantic agents and low-level parameter agents, grounded in generated video, to infer articulation and reconstruct full articulated 3D objects including ...
-
UnfoldArt: Zero-Shot Recovery of Full Articulated 3D Objects from Text or Image
UnfoldArt uses multi-agent debate grounded in vision-language and video models to infer articulation parameters and reconstruct full 3D objects including occluded parts from text or image inputs.
-
RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video
RigPAPR auto-rigs static PAPR point clouds and drives them via direct LBS from monocular fixed-view video, matching baselines at supervised views and exceeding them by 3+dB PSNR at novel views with cleaner joints.
-
SurGe: Improved Surface Geometry in Point Maps
SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel wh...
-
Feedforward 3D Editing Learns from Semantic-Part Transformation
Pxform provides 100K semantic-part 3D edit pairs; PartFlow uses them to deliver feedforward 3D editing with improved fidelity and preservation over prior methods.
-
Hylos: Operability Contracts for Model-Native Spatial Intelligence
Hylos proposes operability contracts and SpatialTransactions to maintain scene-scale state and validate changes in model-generated 3D, shifting evaluation from visual quality to practical operability.
-
MoCapAnything: Unified 3D Motion Capture for Arbitrary Skeletons from Monocular Videos
MoCapAnything reconstructs asset-specific BVH animations from monocular video by predicting 3D joint trajectories then applying constraint-aware inverse kinematics guided by a reference prompt encoder.
-
Objaverse-XL: A Universe of 10M+ 3D Objects
Objaverse-XL supplies over 10 million diverse 3D objects that, when used to render 100 million views, improve zero-shot novel-view synthesis in models such as Zero123.
-
PixGS: Pixel-Space Diffusion for Direct 3D Gaussian Splat Generation
A single-stage pixel-space diffusion model for direct 3D Gaussian Splat generation that bypasses latent compression and adds geometric supervisions to outperform prior multi-stage methods.
-
Feed-forward Motion In-betweening for Any 4D
Proposes a feed-forward keyframe-conditioned in-betweening method for arbitrary 4D meshes using a topology-agnostic VAE and MMDiT-based rectified flow model.
-
Variational Test-time Optimization for Diffusion Synchronization
Derives an optimal control-based variational optimization framework for test-time diffusion synchronization to enhance collaborative generation across modalities.
-
PerceptTwin: Semantic Scene Reconstruction for Iterative LLM Planning and Verification
PerceptTwin creates interactive simulations from open-vocabulary object maps for verifying and refining LLM robot plans, reporting ~39% higher success rates and up to 18% better human verification.
-
MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents
MORPHOS introduces an autoregressive 4D generation method with Temporal Structured Latents (T-SLAT) that produces dynamic 3D assets from videos while handling topological changes and long sequences.
-
Feedforward 3D Editing Learns from Semantic-Part Transformation
Pxform dataset and PartFlow network enable feedforward 3D editing by learning from semantic-part transformations and achieve SOTA on geometric and appearance benchmarks.
-
Toward Visually Realistic Simulation: A Benchmark for Evaluating Robot Manipulation in Simulation
VISER is a new visually realistic simulation benchmark for robot manipulation tasks that uses PBR materials and MLLM-assisted asset generation, achieving 0.92 Pearson correlation with real-world policy performance.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
-
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
RoboCasa supplies a large-scale kitchen simulator, generative assets, 100 tasks, and automated data pipelines that produce a clear scaling trend in imitation learning for generalist robots.
-
SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration
SceneConductor decomposes single-image 3D scene generation into initialization, environment construction, and multi-agent refinement stages with a geometry-aware layout predictor trained on sparse geometric priors fro...
-
Automatically Improving Simulation Physics for Articulated Objects
A simulator-in-the-loop multi-modal method refines physical properties of incomplete 3D articulated objects to improve simulation stability and downstream robot policy performance.
-
EVA01: Unified Native 3D Understanding and Generation via Mixture-of-Transformers
EVA01 introduces a Mixture-of-Transformers model that natively adds 3D mesh understanding, generation, and multi-turn editing to MLLMs by decoupling understanding and generation experts with shared global self-attention.
-
Predicting 3D structure by latent posterior sampling
A two-stage latent-variable model uses diffusion-based score matching to sample 3D scenes from posteriors conditioned on varied observations via volumetric rendering likelihoods.
-
Predicting 3D structure by latent posterior sampling
A latent-variable approach uses diffusion models on NeRF-encoded scene representations to perform posterior sampling for 3D reconstruction from single-view, multi-view, noisy, sparse-pixel, or sparse-depth inputs.
-
Predicting 3D structure by latent posterior sampling
A two-stage method trains NeRF latents then a diffusion prior to sample posteriors for 3D reconstruction from varied observations including single-view, multi-view, noisy, sparse pixels, and sparse depth.
-
Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation
Asset Harvester converts sparse in-the-wild object observations from AV driving logs into complete simulation-ready 3D assets via data curation, geometry-aware preprocessing, and a SparseViewDiT model that couples spa...
-
UniMesh: Unifying 3D Mesh Understanding and Generation
UniMesh unifies 3D mesh generation and understanding in one model via a Mesh Head interface, Chain of Mesh iterative editing, and an Actor-Evaluator self-reflection loop.
-
STEP-Parts: Geometric Partitioning of Boundary Representations for Large-Scale CAD Processing
STEP-Parts produces tessellation-robust geometric part labels from STEP B-Reps by deterministic merging of same-primitive faces, enabling consistent supervision on 180k+ models.
-
A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation
A systematic literature survey that categorizes deep learning architectures for point cloud classification, part segmentation, and semantic segmentation, evaluates them on benchmarks, and discusses innovations, limita...
-
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
Hunyuan3D 2.0 scales flow-based diffusion transformers and texture synthesis models to generate high-resolution textured 3D assets that outperform prior state-of-the-art in geometry, alignment, and texture quality.
-
A Survey on Deep Learning Architectures for Point Cloud Classification and Segmentation
A survey that categorizes deep learning models for point cloud tasks by backbone architecture, evaluates benchmark performance, and outlines challenges and future research directions.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.