CameraCtrl: Enabling Camera Control for Text-to-Video Generation
Pith reviewed 2026-05-13 02:00 UTC · model grok-4.3
The pith
A plug-and-play module adds precise camera pose control to existing text-to-video diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models.
What carries the argument
The plug-and-play camera pose control module, which takes parameterized camera trajectories as input and injects the corresponding signals into a frozen video diffusion model during generation.
If this is right
- Precise camera control becomes available for multiple different video diffusion models without retraining them.
- Controllability and generalization improve when the training data includes wide ranges of camera paths and visual styles matching the base model.
- Users can combine text prompts with explicit camera pose sequences to produce videos that follow chosen cinematic movements.
Where Pith is reading between the lines
- The modular design opens the possibility of adding further independent control signals, such as object motion or lighting, on top of the same base model.
- This separation of concerns could support interactive applications where users adjust camera paths after an initial generation pass.
- The emphasis on dataset camera diversity suggests that future work might systematically catalog and release camera-annotated video collections to boost similar control methods.
Load-bearing premise
Videos with diverse camera distributions and appearance similar to the base model can be collected in sufficient quantity and the added control module will not reduce the base model's visual quality.
What would settle it
A side-by-side comparison where the generated video frames do not exhibit the exact camera motion specified in the input trajectory, or where image quality metrics fall below those of the unmodified base model.
read the original abstract
Controllability plays a crucial role in video generation, as it allows users to create and edit content more precisely. Existing models, however, lack control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CameraCtrl, a plug-and-play camera pose control module for text-to-video diffusion models. It proposes a camera trajectory parameterization, trains the module atop frozen base video diffusion models, and conducts a dataset study claiming that videos with diverse camera distributions and similar appearance to the base model improve controllability and generalization. Experiments are said to demonstrate accurate camera control across different video generation models from text and pose inputs.
Significance. If the central claims hold, this would represent a meaningful advance in controllable video generation by adding precise cinematic camera control without full model retraining. The plug-and-play design and dataset ablation study are positive elements that could facilitate adoption if supported by rigorous evidence of preserved base-model quality.
major comments (2)
- [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
- [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
minor comments (1)
- The camera trajectory parameterization is described at a high level; a dedicated subsection with explicit equations for pose encoding and injection into the diffusion process would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We address the major comments below and have revised the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
Authors: We agree that including quantitative metrics in the abstract would better support this central claim. In the revised manuscript, we will update the abstract to reference the FVD and CLIP score comparisons from our experiments, which demonstrate that the base model quality is largely preserved after inserting and training the control module. These metrics are reported in detail in Section 4.1 of the paper. revision: yes
-
Referee: [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
Authors: The dataset ablation study with quantitative results, including controllability scores and error analysis across different datasets, is provided in Section 4.3 with supporting tables. To address this comment, we will revise the abstract to more explicitly summarize the key quantitative findings from this study, such as improved generalization on diverse camera trajectories. This will help readers connect the claim to the evidence without requiring them to immediately consult the full text. revision: yes
Circularity Check
No circularity in claimed derivation or results
full rationale
The paper presents an empirical engineering contribution: a trainable plug-and-play control module inserted into a frozen video diffusion backbone and trained via standard supervised learning on external video datasets. No equations, predictions, or first-principles claims are offered that reduce to fitted parameters or self-citations by construction. The central statements (accurate camera control, dataset effects on generalization) are supported by experimental comparisons rather than definitional or self-referential loops. This is the common case of a self-contained applied ML paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A separate control module can be trained to steer camera pose while leaving the base video diffusion model parameters untouched.
- domain assumption Dataset characteristics (diverse camera distributions and appearance similarity to base model) directly determine controllability and generalization.
Forward citations
Cited by 60 Pith papers
-
MemLearner: Learning to Query Context memory for Video World Models
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
-
WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis
WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.
-
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.
-
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench curates 360 ground-truth clips and an evaluation suite to diagnose memory consistency failures in video models when objects change state while out of view.
-
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench is a new diagnostic benchmark with 360 synthetic and real clips plus VQA evaluation that tests memory consistency in video models under the disappear-and-reappear paradigm in dynamically changing environments.
-
Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds
Look-Before-Move separates narrative observation specification from camera motion via semantic contracts, Monte Carlo viewpoint search, and trajectory grounding, tested on a new 50-story 3D benchmark.
-
Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds
Look-Before-Move is a framework that converts narrative intent into Semantic Observation Contracts, uses Monte Carlo Viewpoint Search for feasible viewpoints, and applies Semantic Trajectory Grounding for coherent cam...
-
TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy
TryOnCrafter is the first DiT-based framework for camera-controllable video virtual try-on via a renderable 4D try-on proxy distilled from 2D priors into 3DGS avatar animated with SMPL-X.
-
GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction
GeoT2V-Bench is a reconstruction-based benchmark that reveals disagreements among multiple metrics for 3D consistency in text-to-video models.
-
Geo-Align: Video Generation Alignment via Metric Geometry Reward
Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
-
Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning
PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce regi...
-
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4...
-
Probing into Camera Control of Video Models
A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.
-
MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics
MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...
-
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.
-
Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting
Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.
-
WorldMark: A Unified Benchmark Suite for Interactive Video World Models
WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.
-
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
-
Geometrically Consistent Multi-View Scene Generation from Freehand Sketches
A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...
-
Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation
Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.
-
Novel View Synthesis as Video Completion
Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.
-
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.
-
MoRight: Motion Control Done Right
MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
-
OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control
OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.
-
SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras
SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.
-
Setting the Stage: Text-Driven Scene-Consistent Image Generation
A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.
-
StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space
A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.
-
GimbalDiffusion: Gravity-Aware Camera Control for Video Generation
GimbalDiffusion adds gravity-referenced absolute camera control and null-pitch conditioning to text-to-video diffusion models, trained on full-sphere panoramic data, to support extreme trajectories and reduce prompt e...
-
HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control
HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.
-
NeoMap: Training-free Novel-View Synthesis from Single Images and Videos
NeoMap introduces a training-free framework using convergent manifold alternating projection iterations to extract high-fidelity novel views from pre-trained video models, outperforming prior methods on standard benchmarks.
-
World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video
A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.
-
SIFT: Self-Imagination Fine-Tuning for Physically Plausible Motion in Video Diffusion Models
SIFT fine-tunes video diffusion models on self-generated videos using motion-aware supervision to reduce motion entanglement and improve physical plausibility.
-
MemoBench: Benchmarking World Modeling in Dynamically Changing Environments
MemoBench curates 360 clips and an evaluation suite to test video models on recovering updated object states after disappear-and-reappear in changing environments.
-
Neural Voxel Dynamics: Learning Implicit 3D Physics via Volumetric Feature Advection
A self-supervised framework learns implicit 3D physics by lifting V-JEPA features into voxels and performing volumetric feature advection conditioned on actions.
-
NavWM: A Unified Navigation World Model for Foresight-Driven Planning
NavWM unifies latent world tokens and anchor-based multimodal trajectory forecasting into a closed-loop planner that improves future state generation and zero-shot navigation.
-
Current World Models Lack a Persistent State Core
Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.
-
TriMotion: Modality-Agnostic Camera Control for Video Generation
TriMotion is a modality-agnostic framework that maps video, pose, and text descriptions of the same camera trajectory into a shared motion embedding space, trained with a new triplet dataset and latent consistency obj...
-
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory
PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.
-
OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data
OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.
-
Latent Spatial Memory for Video World Models
Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while rea...
-
Prisma-World: Camera-Controllable Multi-Agent Video World Model
Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from ...
-
Streaming Video Generation with Streaming Force Control
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
-
Cosmos 3: Omnimodal World Models for Physical AI
Cosmos 3 presents a unified omnimodal world model family based on mixture-of-transformers that processes language, vision, audio, and action for Physical AI applications.
-
Geometry-Aware Implicit Memory for Video World Models
GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implic...
-
AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance
AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and text...
-
E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control
E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.
-
GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.
-
Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video
Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.
-
ReactiveGWM: Steering NPC in Reactive Game World Models
ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh proposes a VAE-based disentanglement of base mesh, motion trajectories, and rectification offset plus Triflow Attention and rectified-flow diffusion to produce 4D meshes aligned to video despite initial pose mismatch.
-
UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis
UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...
-
$h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement
h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.
-
RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control
RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.
-
Vista4D: Video Reshooting with 4D Point Clouds
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
-
Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.
-
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
Reference graph
Works this paper leans on
-
[3]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Advances in Neural Information Processing Systems , volume=
Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=
-
[6]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[7]
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=
work page internal anchor Pith review arXiv
-
[8]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Vector quantized diffusion model for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[9]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Multi-concept customization of text-to-image diffusion , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[11]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Versatile diffusion: Text, images and variations all in one diffusion model , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[12]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[13]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[14]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[16]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[17]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[22]
Dreampose: Fashion image-to-video synthesis via stable diffusion,
Dreampose: Fashion image-to-video synthesis via stable diffusion , author=. arXiv preprint arXiv:2304.06025 , year=
-
[25]
Advances in Neural Information Processing Systems , volume=
Light field networks: Neural scene representations with single-evaluation rendering , author=. Advances in Neural Information Processing Systems , volume=
-
[26]
arXiv preprint arXiv:2304.13681 , year=
Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation , author=. arXiv preprint arXiv:2304.13681 , year=
-
[27]
arXiv preprint arXiv:2312.04551 , year=
Free3D: Consistent Novel View Synthesis without 3D Representation , author=. arXiv preprint arXiv:2312.04551 , year=
-
[29]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Objaverse: A universe of annotated 3d objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[30]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mvimgnet: A large-scale dataset of multi-view images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[32]
Advances in Neural Information Processing Systems , volume=
Videocomposer: Compositional video synthesis with motion controllability , author=. Advances in Neural Information Processing Systems , volume=
-
[34]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[35]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Infinite nature: Perpetual view generation of natural scenes from a single image , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[36]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[37]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Learning the depths of moving people by watching frozen people , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[38]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
DreamPose: Fashion Video Synthesis with Stable Diffusion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[39]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[41]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation , author=. 2023 , eprint=
work page 2023
-
[44]
Advances in Neural Information Processing Systems , volume=
Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=
-
[45]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[46]
MagicVideo: Efficient Video Generation With Latent Diffusion Models
Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=
work page internal anchor Pith review arXiv
- [52]
-
[54]
arXiv preprint arXiv:2310.08465 , year=
Motiondirector: Motion customization of text-to-video diffusion models , author=. arXiv preprint arXiv:2310.08465 , year=
-
[60]
IEEE International Conference on Computer Vision (ICCV) , year=
Text2video-zero: Text-to-image diffusion models are zero-shot video generators , author=. IEEE International Conference on Computer Vision (ICCV) , year=
-
[61]
arXiv preprint arXiv:2305.04001 , year=
AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion , author=. arXiv preprint arXiv:2305.04001 , year=
-
[62]
arXiv preprint arXiv:2304.08551 , year=
Generative Disco: Text-to-Video Generation for Music Visualization , author=. arXiv preprint arXiv:2304.08551 , year=
-
[63]
Structure-from-Motion Revisited , booktitle=
Sch\". Structure-from-Motion Revisited , booktitle=
-
[64]
MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images , author=. arXiv preprint arXiv:2306.07257 , year=
-
[65]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Conditional Image-to-Video Generation with Latent Flow Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
- [67]
-
[68]
SG 161222 civitai , title =
-
[69]
SO3 roration distance , howpublished =
Boris Belousov. SO3 roration distance , howpublished =
-
[71]
LAVIS : A One-stop Library for Language-Vision Intelligence
Li, Dongxu and Li, Junnan and Le, Hung and Wang, Guangsen and Savarese, Silvio and Hoi, Steven C.H. LAVIS : A One-stop Library for Language-Vision Intelligence. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023
work page 2023
-
[73]
The Twelfth International Conference on Learning Representations , year=
Seine: Short-to-long video diffusion model for generative transition and prediction , author=. The Twelfth International Conference on Learning Representations , year=
-
[74]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Structure and content-guided video synthesis with diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[76]
https://openreview.net/forum?id=rylgEULtdN , year=
FVD: A new metric for video generation , author=. https://openreview.net/forum?id=rylgEULtdN , year=
-
[77]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[79]
Advances in Neural Information Processing Systems , volume=
Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=
-
[80]
Compositional 3D Scene Generation using Locally Conditioned Diffusion , author=. ArXiv , year=
-
[81]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[83]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks , author=. 2024 , eprint=
work page 2024
-
[84]
Raft: Recurrent all-pairs field transforms for optical flow , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=
work page 2020
-
[85]
Advances in Neural Information Processing Systems , year=
CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=
-
[87]
Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024
Training-free Camera Control for Video Generation , author=. arXiv preprint arXiv:2406.10126 , year=
-
[93]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Snap video: Scaled spatiotemporal transformers for text-to-video synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[95]
Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =
work page 2024
-
[96]
Vd3d: Taming large video diffusion transformers for 3d camera control,
Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781, 2024
-
[97]
Frozen in time: A joint video and image encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1728--1738, 2021
work page 2021
-
[98]
Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024
Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024
-
[99]
Boris Belousov. So3 roration distance. http://www.boris-belousov.net/2016/12/01/quat-dist/
work page 2016
-
[100]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[101]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b
work page 2023
-
[102]
BradCatt. Toonyou. https://civitai.com/models/30240/toonyou
-
[103]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. URL https://openai.com/research/video-generation-models-as-world-...
work page 2024
-
[104]
Videocrafter1: Open diffusion models for high-quality video generation, 2023 a
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023 a
work page 2023
-
[105]
arXiv preprint arXiv:2304.14404 (2023) 3
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023 b
-
[106]
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023 c
-
[107]
Seine: Short-to-long video diffusion model for generative transition and prediction
Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations, 2023 d
work page 2023
-
[108]
Boosting camera motion control for video diffusion transformers
Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, and Chun-Hao Paul Huang. Boosting camera motion control for video diffusion transformers. arXiv preprint arXiv:2410.10802, 2024
-
[109]
SG 161222 civitai. Realistic vision. https://civitai.com/models/4201/realistic-vision-v60-b1
-
[110]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023
work page 2023
-
[111]
Structure and content-guided video synthesis with diffusion models
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7346--7356, 2023
work page 2023
-
[112]
arXiv preprint arXiv:2311.16933 , year =
Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023 a
-
[113]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[114]
Photorealistic video generation with diffusion models
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos \'e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023
-
[115]
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022
work page internal anchor Pith review arXiv 2022
-
[116]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[117]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020
work page 2020
-
[118]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[119]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022 b
work page internal anchor Pith review arXiv 2022
-
[120]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.