pith. sign in

arxiv: 2404.02101 · v2 · submitted 2024-04-02 · 💻 cs.CV

CameraCtrl: Enabling Camera Control for Text-to-Video Generation

Pith reviewed 2026-05-13 02:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video generationcamera pose controlvideo diffusion modelsplug-and-play modulecamera trajectorycinematic videocontrollable generation
0
0 comments X

The pith

A plug-and-play module adds precise camera pose control to existing text-to-video diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CameraCtrl to give text-to-video AI models control over camera movements such as pans, tilts, and zooms. Current diffusion-based video generators produce content from text but cannot follow user-specified camera paths, which limits their use for narrative or cinematic results. The solution adds a separate control module trained on video data that already contains varied camera trajectories while keeping the original model weights unchanged. Experiments show that datasets with wide camera variety and visual styles close to the base model improve both accuracy and how well the control transfers to new prompts. This setup lets users supply both text and a camera trajectory to generate videos with directed motion.

Core claim

We introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models.

What carries the argument

The plug-and-play camera pose control module, which takes parameterized camera trajectories as input and injects the corresponding signals into a frozen video diffusion model during generation.

If this is right

  • Precise camera control becomes available for multiple different video diffusion models without retraining them.
  • Controllability and generalization improve when the training data includes wide ranges of camera paths and visual styles matching the base model.
  • Users can combine text prompts with explicit camera pose sequences to produce videos that follow chosen cinematic movements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular design opens the possibility of adding further independent control signals, such as object motion or lighting, on top of the same base model.
  • This separation of concerns could support interactive applications where users adjust camera paths after an initial generation pass.
  • The emphasis on dataset camera diversity suggests that future work might systematically catalog and release camera-annotated video collections to boost similar control methods.

Load-bearing premise

Videos with diverse camera distributions and appearance similar to the base model can be collected in sufficient quantity and the added control module will not reduce the base model's visual quality.

What would settle it

A side-by-side comparison where the generated video frames do not exhibit the exact camera motion specified in the input trajectory, or where image quality metrics fall below those of the unmodified base model.

read the original abstract

Controllability plays a crucial role in video generation, as it allows users to create and edit content more precisely. Existing models, however, lack control of camera pose that serves as a cinematic language to express deeper narrative nuances. To alleviate this issue, we introduce CameraCtrl, enabling accurate camera pose control for video diffusion models. Our approach explores effective camera trajectory parameterization along with a plug-and-play camera pose control module that is trained on top of a video diffusion model, leaving other modules of the base model untouched. Moreover, a comprehensive study on the effect of various training datasets is conducted, suggesting that videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization. Experimental results demonstrate the effectiveness of CameraCtrl in achieving precise camera control with different video generation models, marking a step forward in the pursuit of dynamic and customized video storytelling from textual and camera pose inputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CameraCtrl, a plug-and-play camera pose control module for text-to-video diffusion models. It proposes a camera trajectory parameterization, trains the module atop frozen base video diffusion models, and conducts a dataset study claiming that videos with diverse camera distributions and similar appearance to the base model improve controllability and generalization. Experiments are said to demonstrate accurate camera control across different video generation models from text and pose inputs.

Significance. If the central claims hold, this would represent a meaningful advance in controllable video generation by adding precise cinematic camera control without full model retraining. The plug-and-play design and dataset ablation study are positive elements that could facilitate adoption if supported by rigorous evidence of preserved base-model quality.

major comments (2)
  1. [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.
  2. [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.
minor comments (1)
  1. The camera trajectory parameterization is described at a high level; a dedicated subsection with explicit equations for pose encoding and injection into the diffusion process would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address the major comments below and have revised the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the control module enables accurate camera control 'leaving other modules of the base model untouched' is load-bearing, yet the abstract provides no quantitative metrics (FVD, CLIP score, or similar) comparing base-model quality before versus after module insertion and training. This omission prevents verification that distribution shift has not occurred.

    Authors: We agree that including quantitative metrics in the abstract would better support this central claim. In the revised manuscript, we will update the abstract to reference the FVD and CLIP score comparisons from our experiments, which demonstrate that the base model quality is largely preserved after inserting and training the control module. These metrics are reported in detail in Section 4.1 of the paper. revision: yes

  2. Referee: [Abstract] Abstract / dataset study: The claim that 'videos with diverse camera distributions and similar appearance to the base model indeed enhance controllability and generalization' is presented as a finding, but no ablation tables, quantitative controllability scores, or error analysis are referenced to support the cross-dataset conclusions.

    Authors: The dataset ablation study with quantitative results, including controllability scores and error analysis across different datasets, is provided in Section 4.3 with supporting tables. To address this comment, we will revise the abstract to more explicitly summarize the key quantitative findings from this study, such as improved generalization on diverse camera trajectories. This will help readers connect the claim to the evidence without requiring them to immediately consult the full text. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The paper presents an empirical engineering contribution: a trainable plug-and-play control module inserted into a frozen video diffusion backbone and trained via standard supervised learning on external video datasets. No equations, predictions, or first-principles claims are offered that reduce to fitted parameters or self-citations by construction. The central statements (accurate camera control, dataset effects on generalization) are supported by experimental comparisons rather than definitional or self-referential loops. This is the common case of a self-contained applied ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that camera pose can be decoupled from appearance and content generation in diffusion models, and that training data properties (camera diversity and appearance similarity) causally improve control without side effects.

axioms (2)
  • domain assumption A separate control module can be trained to steer camera pose while leaving the base video diffusion model parameters untouched.
    Stated in the abstract as the training strategy for the plug-and-play module.
  • domain assumption Dataset characteristics (diverse camera distributions and appearance similarity to base model) directly determine controllability and generalization.
    Presented as the outcome of the comprehensive study on training datasets.

pith-pipeline@v0.9.0 · 5464 in / 1242 out tokens · 41730 ms · 2026-05-13T02:00:07.850613+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MemLearner: Learning to Query Context memory for Video World Models

    cs.CV 2026-06 unverdicted novelty 7.0

    MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

  2. WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis

    cs.CV 2026-06 unverdicted novelty 7.0

    WarpHammer densifies scene warps with 3D object priors from generative models and fuses pose-unknown auxiliary views via multi-view geometry to enable stable extreme novel view synthesis.

  3. MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

    cs.CV 2026-06 unverdicted novelty 7.0

    MemoBench is a new diagnostic benchmark with automated and VQA metrics that evaluates memory consistency in video models under disappear-and-reappear in dynamic environments.

  4. MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

    cs.CV 2026-06 unverdicted novelty 7.0

    MemoBench curates 360 ground-truth clips and an evaluation suite to diagnose memory consistency failures in video models when objects change state while out of view.

  5. MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

    cs.CV 2026-06 unverdicted novelty 7.0

    MemoBench is a new diagnostic benchmark with 360 synthetic and real clips plus VQA evaluation that tests memory consistency in video models under the disappear-and-reappear paradigm in dynamically changing environments.

  6. Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds

    cs.AI 2026-06 unverdicted novelty 7.0

    Look-Before-Move separates narrative observation specification from camera motion via semantic contracts, Monte Carlo viewpoint search, and trajectory grounding, tested on a new 50-story 3D benchmark.

  7. Look-Before-Move: Narrative-Grounded World Visual Attention in Dynamic 3D Story Worlds

    cs.AI 2026-06 unverdicted novelty 7.0

    Look-Before-Move is a framework that converts narrative intent into Semantic Observation Contracts, uses Monte Carlo Viewpoint Search for feasible viewpoints, and applies Semantic Trajectory Grounding for coherent cam...

  8. TryOnCrafter: Unleashing Camera Trajectories for Realistic Video Virtual Try-on via a Renderable 4D Try-on Proxy

    cs.CV 2026-06 unverdicted novelty 7.0

    TryOnCrafter is the first DiT-based framework for camera-controllable video virtual try-on via a renderable 4D try-on proxy distilled from 2D priors into 3DGS avatar animated with SMPL-X.

  9. GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction

    cs.CV 2026-06 unverdicted novelty 7.0

    GeoT2V-Bench is a reconstruction-based benchmark that reveals disagreements among multiple metrics for 3D consistency in text-to-video models.

  10. Geo-Align: Video Generation Alignment via Metric Geometry Reward

    cs.CV 2026-05 unverdicted novelty 7.0

    Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.

  11. Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning

    cs.CV 2026-05 unverdicted novelty 7.0

    PREX decomposes target 4D video volumes into Preserve, Reveal, and Expand roles with a region-aware adapter on a frozen diffusion backbone, trained via proxy tasks, and introduces the PREBench benchmark to reduce regi...

  12. DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4...

  13. Probing into Camera Control of Video Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A training-free method reformulates camera control as geometric displacement fields applied via differentiable latent resampling, enabling control and bias probing in video diffusion models.

  14. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 7.0

    R-DMesh generates high-fidelity 4D meshes aligned to video by disentangling base mesh, motion, and a learned rectification jump offset inside a VAE, then using Triflow Attention and rectified-flow diffusion.

  15. GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.

  16. MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

    cs.CV 2026-05 unverdicted novelty 7.0

    MoCam unifies static and dynamic novel view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion denoising process.

  17. MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

    cs.CV 2026-05 unverdicted novelty 7.0

    MoCam uses structured denoising dynamics in diffusion models to temporally decouple geometric alignment from appearance refinement, enabling unified novel view synthesis that outperforms prior methods on imperfect poi...

  18. $h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

    cs.CV 2026-05 unverdicted novelty 7.0

    h-control introduces block-conditional pseudo-Gibbs refinement for training-free camera control in flow-matching video generators, achieving superior FVD scores on RealEstate10K and DAVIS benchmarks.

  19. Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting

    cs.CV 2026-04 unverdicted novelty 7.0

    Reshoot-Anything trains a diffusion transformer on pseudo multi-view triplets created by cropping and warping monocular videos to achieve temporally consistent video reshooting with robust camera control on dynamic scenes.

  20. WorldMark: A Unified Benchmark Suite for Interactive Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    WorldMark is the first public benchmark that standardizes scenes, trajectories, and control interfaces across heterogeneous interactive image-to-video world models.

  21. MultiWorld: Scalable Multi-Agent Multi-View Video World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MultiWorld is a scalable framework for multi-agent multi-view video world models that improves controllability and consistency over single-agent baselines in game and robot tasks.

  22. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 7.0

    UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.

  23. Geometrically Consistent Multi-View Scene Generation from Freehand Sketches

    cs.CV 2026-04 unverdicted novelty 7.0

    A framework generates consistent multi-view scenes from one freehand sketch via a ~9k-sample dataset, Parallel Camera-Aware Attention Adapters, and Sparse Correspondence Supervision Loss, outperforming baselines in re...

  24. Prompt Relay: Inference-Time Temporal Control for Multi-Event Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Prompt Relay is an inference-time plug-and-play method that penalizes cross-attention to enforce temporal prompt alignment and reduce semantic entanglement in multi-event video generation.

  25. Novel View Synthesis as Video Completion

    cs.CV 2026-04 unverdicted novelty 7.0

    Video diffusion models can be adapted into permutation-invariant generators for sparse novel view synthesis by treating the problem as video completion and removing temporal order cues.

  26. MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MotionScape is a large-scale UAV video dataset with highly dynamic 6-DoF motions, geometric trajectories, and semantic annotations to train world models that better simulate complex 3D dynamics under large viewpoint changes.

  27. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  28. OmniCamera: A Unified Framework for Multi-task Video Generation with Arbitrary Camera Control

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniCamera disentangles video content and camera motion for multi-task generation with arbitrary camera control via the OmniCAM hybrid dataset and Dual-level Curriculum Co-Training.

  29. SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

    cs.CV 2026-03 unverdicted novelty 7.0

    SparseCam4D achieves spatio-temporally consistent high-fidelity 4D reconstruction from sparse cameras via a Spatio-Temporal Distortion Field that corrects inconsistencies in generative observations.

  30. Setting the Stage: Text-Driven Scene-Consistent Image Generation

    cs.CV 2025-12 conditional novelty 7.0

    A new data pipeline using real photos, entity removal, and image-to-video models plus a cross-view attention loss enables text-driven generation of actors in reference scenes with improved alignment.

  31. StereoSpace: Depth-Free Synthesis of Stereo Geometry via End-to-End Diffusion in a Canonical Space

    cs.CV 2025-12 unverdicted novelty 7.0

    A viewpoint-conditioned diffusion model generates stereo image pairs from monocular input in a canonical rectified space without using depth or explicit warping.

  32. GimbalDiffusion: Gravity-Aware Camera Control for Video Generation

    cs.CV 2025-12 conditional novelty 7.0

    GimbalDiffusion adds gravity-referenced absolute camera control and null-pitch conditioning to text-to-video diffusion models, trained on full-sphere panoramic data, to support extreme trajectories and reduce prompt e...

  33. HandsOnWorld: Unconstrained Egocentric Video Generation with Camera-Disentangled Hand Control

    cs.CV 2026-07 unverdicted novelty 6.0

    HandsOnWorld creates a hand-controlled egocentric video generator from unconstrained monocular video via a new EgoVid-Pro dataset from monocular reconstruction and a Plücker Hand Map that disentangles camera and hand motion.

  34. NeoMap: Training-free Novel-View Synthesis from Single Images and Videos

    cs.CV 2026-07 unverdicted novelty 6.0

    NeoMap introduces a training-free framework using convergent manifold alternating projection iterations to extract high-fidelity novel views from pre-trained video models, outperforming prior methods on standard benchmarks.

  35. World from Motion: Generative Dynamic Gaussian Reconstruction from Monocular Video

    cs.CV 2026-07 unverdicted novelty 6.0

    A generative video model conditioned on pixel-aligned 3D renderings produces consistent dynamic 3D Gaussian splats from monocular video and sets new SOTA in 4D reconstruction.

  36. SIFT: Self-Imagination Fine-Tuning for Physically Plausible Motion in Video Diffusion Models

    cs.CV 2026-06 unverdicted novelty 6.0

    SIFT fine-tunes video diffusion models on self-generated videos using motion-aware supervision to reduce motion entanglement and improve physical plausibility.

  37. MemoBench: Benchmarking World Modeling in Dynamically Changing Environments

    cs.CV 2026-06 unverdicted novelty 6.0

    MemoBench curates 360 clips and an evaluation suite to test video models on recovering updated object states after disappear-and-reappear in changing environments.

  38. Neural Voxel Dynamics: Learning Implicit 3D Physics via Volumetric Feature Advection

    cs.CV 2026-06 unverdicted novelty 6.0

    A self-supervised framework learns implicit 3D physics by lifting V-JEPA features into voxels and performing volumetric feature advection conditioned on actions.

  39. NavWM: A Unified Navigation World Model for Foresight-Driven Planning

    cs.RO 2026-06 unverdicted novelty 6.0

    NavWM unifies latent world tokens and anchor-based multimodal trajectory forecasting into a closed-loop planner that improves future state generation and zero-shot navigation.

  40. Current World Models Lack a Persistent State Core

    cs.CV 2026-06 unverdicted novelty 6.0

    Current world models fail to evolve internal state when unobserved and instead resume scenes at the last observed state, as diagnosed by the new WRBench benchmark across 23 models and 9600 videos.

  41. TriMotion: Modality-Agnostic Camera Control for Video Generation

    cs.CV 2026-06 unverdicted novelty 6.0

    TriMotion is a modality-agnostic framework that maps video, pose, and text descriptions of the same camera trajectory into a shared motion embedding space, trained with a new triplet dataset and latent consistency obj...

  42. PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

    cs.CV 2026-06 unverdicted novelty 6.0

    PermaVid disentangles spatial context into semantic appearance and geometric structure via multi-modal memory banks and edit-aware updates to maintain long-term consistency in video generation after edits.

  43. OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

    cs.CV 2026-06 unverdicted novelty 6.0

    OmniDirector introduces a grid-based camera representation and hierarchical prompt agent for multi-shot camera cloning in video diffusion models trained on million-scale unpaired data.

  44. Latent Spatial Memory for Video World Models

    cs.CV 2026-06 unverdicted novelty 6.0

    Mirage stores and queries 3D scene information in diffusion latent space via depth-guided lifting and warping, yielding 10.57× faster generation and 55× smaller memory than explicit RGB point-cloud baselines while rea...

  45. Prisma-World: Camera-Controllable Multi-Agent Video World Model

    cs.CV 2026-06 unverdicted novelty 6.0

    Prisma-World is a diffusion-based multi-agent video model that uses joint full-attention, multi-agent RoPE, and relative camera geometry injection plus curriculum training to produce consistent cross-view videos from ...

  46. Streaming Video Generation with Streaming Force Control

    cs.CV 2026-06 unverdicted novelty 6.0

    StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.

  47. Cosmos 3: Omnimodal World Models for Physical AI

    cs.CV 2026-06 unverdicted novelty 6.0

    Cosmos 3 presents a unified omnimodal world model family based on mixture-of-transformers that processes language, vision, audio, and action for Physical AI applications.

  48. Geometry-Aware Implicit Memory for Video World Models

    cs.CV 2026-06 unverdicted novelty 6.0

    GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implic...

  49. AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance

    cs.GR 2026-05 unverdicted novelty 6.0

    AlbedoEdit fine-tunes video foundation models to translate RGB videos into edited versions conditioned on user-edited first-frame albedo maps, trained on a new synthetic paired dataset for insertion, removal, and text...

  50. E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

    cs.CV 2026-05 unverdicted novelty 6.0

    E³C is a video diffusion model that disentangles persistent 3D scene structure via point-cloud memory from human dynamics via ego-exo pose controls for improved egocentric video generation on the Nymeria dataset.

  51. GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoFlow adds a geometry-consistency reward based on rigid camera flow and object appearance preservation, integrated via reinforcement fine-tuning to improve geometric coherence in video generation.

  52. Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

    cs.CV 2026-05 unverdicted novelty 6.0

    Warp-as-History enables zero-shot camera trajectory following in frozen video models by supplying camera-warped pseudo-history, with single-video LoRA fine-tuning improving generalization to unseen videos.

  53. ReactiveGWM: Steering NPC in Reactive Game World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ReactiveGWM introduces a decoupled diffusion architecture for player-NPC interactions that learns game-agnostic response logic for zero-shot strategy transfer across games.

  54. R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow

    cs.CV 2026-05 unverdicted novelty 6.0

    R-DMesh proposes a VAE-based disentanglement of base mesh, motion trajectories, and rectification offset plus Triflow Attention and rectified-flow diffusion to produce 4D meshes aligned to video despite initial pose mismatch.

  55. UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...

  56. $h$-control: Training-Free Camera Control via Block-Conditional Gibbs Refinement

    cs.CV 2026-05 unverdicted novelty 6.0

    h-control augments hard-replacement guidance with block-conditional pseudo-Gibbs refinement on unobserved latent sites and adaptive 3D patch freezing to achieve superior FVD on RealEstate10K and DAVIS.

  57. RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control

    cs.CV 2026-05 unverdicted novelty 6.0

    RealCam is a causal autoregressive model for real-time camera-controlled video-to-video generation, using cross-frame in-context teacher distillation and loop-closed data augmentation to achieve high fidelity and consistency.

  58. Vista4D: Video Reshooting with 4D Point Clouds

    cs.CV 2026-04 unverdicted novelty 6.0

    Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

  59. Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    A decoupled memory branch with hybrid cues, cross-attention, and gating improves spatial consistency and data efficiency in long-horizon camera-trajectory video generation.

  60. UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

    cs.CV 2026-04 unverdicted novelty 6.0

    UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · cited by 84 Pith papers · 25 internal anchors

  1. [3]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Glide: Towards photorealistic image generation and editing with text-guided diffusion models , author=. arXiv preprint arXiv:2112.10741 , year=

  2. [4]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , year=

  3. [5]

    Advances in Neural Information Processing Systems , volume=

    Photorealistic text-to-image diffusion models with deep language understanding , author=. Advances in Neural Information Processing Systems , volume=

  4. [6]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    High-resolution image synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  5. [7]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    ediffi: Text-to-image diffusion models with an ensemble of expert denoisers , author=. arXiv preprint arXiv:2211.01324 , year=

  6. [8]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Vector quantized diffusion model for text-to-image synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  7. [9]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Multi-concept customization of text-to-image diffusion , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  8. [11]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Versatile diffusion: Text, images and variations all in one diffusion model , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  9. [12]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  10. [13]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Advancing high-resolution video-language representation with large-scale video transcriptions , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  11. [14]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Frozen in time: A joint video and image encoder for end-to-end retrieval , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  12. [16]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Align your latents: High-resolution video synthesis with latent diffusion models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  13. [17]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  14. [22]

    Dreampose: Fashion image-to-video synthesis via stable diffusion,

    Dreampose: Fashion image-to-video synthesis via stable diffusion , author=. arXiv preprint arXiv:2304.06025 , year=

  15. [25]

    Advances in Neural Information Processing Systems , volume=

    Light field networks: Neural scene representations with single-evaluation rendering , author=. Advances in Neural Information Processing Systems , volume=

  16. [26]

    arXiv preprint arXiv:2304.13681 , year=

    Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation , author=. arXiv preprint arXiv:2304.13681 , year=

  17. [27]

    arXiv preprint arXiv:2312.04551 , year=

    Free3D: Consistent Novel View Synthesis without 3D Representation , author=. arXiv preprint arXiv:2312.04551 , year=

  18. [29]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Objaverse: A universe of annotated 3d objects , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  19. [30]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mvimgnet: A large-scale dataset of multi-view images , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  20. [32]

    Advances in Neural Information Processing Systems , volume=

    Videocomposer: Compositional video synthesis with motion controllability , author=. Advances in Neural Information Processing Systems , volume=

  21. [34]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Adding conditional control to text-to-image diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  22. [35]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Infinite nature: Perpetual view generation of natural scenes from a single image , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  23. [36]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Scannet: Richly-annotated 3d reconstructions of indoor scenes , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  24. [37]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Learning the depths of moving people by watching frozen people , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  25. [38]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    DreamPose: Fashion Video Synthesis with Stable Diffusion , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  26. [39]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  27. [41]

    2023 , eprint=

    VideoCrafter1: Open Diffusion Models for High-Quality Video Generation , author=. 2023 , eprint=

  28. [44]

    Advances in Neural Information Processing Systems , volume=

    Denoising diffusion probabilistic models , author=. Advances in Neural Information Processing Systems , volume=

  29. [45]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Scalable diffusion models with transformers , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  30. [46]

    MagicVideo: Efficient Video Generation With Latent Diffusion Models

    Magicvideo: Efficient video generation with latent diffusion models , author=. arXiv preprint arXiv:2211.11018 , year=

  31. [52]

    2024 , url=

    Video generation models as world simulators , author=. 2024 , url=

  32. [54]

    arXiv preprint arXiv:2310.08465 , year=

    Motiondirector: Motion customization of text-to-video diffusion models , author=. arXiv preprint arXiv:2310.08465 , year=

  33. [60]

    IEEE International Conference on Computer Vision (ICCV) , year=

    Text2video-zero: Text-to-image diffusion models are zero-shot video generators , author=. IEEE International Conference on Computer Vision (ICCV) , year=

  34. [61]

    arXiv preprint arXiv:2305.04001 , year=

    AADiff: Audio-Aligned Video Synthesis with Text-to-Image Diffusion , author=. arXiv preprint arXiv:2305.04001 , year=

  35. [62]

    arXiv preprint arXiv:2304.08551 , year=

    Generative Disco: Text-to-Video Generation for Music Visualization , author=. arXiv preprint arXiv:2304.08551 , year=

  36. [63]

    Structure-from-Motion Revisited , booktitle=

    Sch\". Structure-from-Motion Revisited , booktitle=

  37. [64]

    Moviefactory: Automatic movie creation from text using large generative models for language and images,

    MovieFactory: Automatic Movie Creation from Text using Large Generative Models for Language and Images , author=. arXiv preprint arXiv:2306.07257 , year=

  38. [65]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Conditional Image-to-Video Generation with Latent Flow Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  39. [67]

    ToonYou , howpublished =

    BradCatt. ToonYou , howpublished =

  40. [68]

    SG 161222 civitai , title =

  41. [69]

    SO3 roration distance , howpublished =

    Boris Belousov. SO3 roration distance , howpublished =

  42. [71]

    LAVIS : A One-stop Library for Language-Vision Intelligence

    Li, Dongxu and Li, Junnan and Le, Hung and Wang, Guangsen and Savarese, Silvio and Hoi, Steven C.H. LAVIS : A One-stop Library for Language-Vision Intelligence. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2023

  43. [73]

    The Twelfth International Conference on Learning Representations , year=

    Seine: Short-to-long video diffusion model for generative transition and prediction , author=. The Twelfth International Conference on Learning Representations , year=

  44. [74]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    Structure and content-guided video synthesis with diffusion models , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  45. [76]

    https://openreview.net/forum?id=rylgEULtdN , year=

    FVD: A new metric for video generation , author=. https://openreview.net/forum?id=rylgEULtdN , year=

  46. [77]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  47. [79]

    Advances in Neural Information Processing Systems , volume=

    Elucidating the design space of diffusion-based generative models , author=. Advances in Neural Information Processing Systems , volume=

  48. [80]

    ArXiv , year=

    Compositional 3D Scene Generation using Locally Conditioned Diffusion , author=. ArXiv , year=

  49. [81]

    Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

  50. [83]

    2024 , eprint=

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks , author=. 2024 , eprint=

  51. [84]

    Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=

    Raft: Recurrent all-pairs field transforms for optical flow , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part II 16 , pages=. 2020 , organization=

  52. [85]

    Advances in Neural Information Processing Systems , year=

    CAT3D: Create Anything in 3D with Multi-View Diffusion Models , author=. Advances in Neural Information Processing Systems , year=

  53. [87]

    Training-free camera control for video generation.arXiv preprint arXiv:2406.10126, 2024

    Training-free Camera Control for Video Generation , author=. arXiv preprint arXiv:2406.10126 , year=

  54. [93]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Snap video: Scaled spatiotemporal transformers for text-to-video synthesis , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  55. [95]

    2024 , url =

    Zangwei Zheng and Xiangyu Peng and Tianji Yang and Chenhui Shen and Shenggui Li and Hongxin Liu and Yukun Zhou and Tianyi Li and Yang You , title =. 2024 , url =

  56. [96]

    Vd3d: Taming large video diffusion transformers for 3d camera control,

    Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, et al. Vd3d: Taming large video diffusion transformers for 3d camera control. arXiv preprint arXiv:2407.12781, 2024

  57. [97]

    Frozen in time: A joint video and image encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G \"u l Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 1728--1738, 2021

  58. [98]

    Lumiere: A space-time diffusion model for video generation.arXiv preprint arXiv:2401.12945, 2024

    Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945, 2024

  59. [99]

    So3 roration distance

    Boris Belousov. So3 roration distance. http://www.boris-belousov.net/2016/12/01/quat-dist/

  60. [100]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023 a

  61. [101]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 22563--22575, 2023 b

  62. [102]

    BradCatt. Toonyou. https://civitai.com/models/30240/toonyou

  63. [103]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. URL https://openai.com/research/video-generation-models-as-world-...

  64. [104]

    Videocrafter1: Open diffusion models for high-quality video generation, 2023 a

    Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023 a

  65. [105]

    arXiv preprint arXiv:2304.14404 (2023) 3

    Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, and Ming-Hsuan Yang. Motion-conditioned diffusion model for controllable video synthesis. arXiv preprint arXiv:2304.14404, 2023 b

  66. [106]

    Control-A-Video: controllable text-to-video generation with diffusion models.arXiv preprint arXiv:2305.13840, 2023

    Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, and Liang Lin. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023 c

  67. [107]

    Seine: Short-to-long video diffusion model for generative transition and prediction

    Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations, 2023 d

  68. [108]

    Boosting camera motion control for video diffusion transformers

    Soon Yau Cheong, Duygu Ceylan, Armin Mustafa, Andrew Gilbert, and Chun-Hao Paul Huang. Boosting camera motion control for video diffusion transformers. arXiv preprint arXiv:2410.10802, 2024

  69. [109]

    Realistic vision

    SG 161222 civitai. Realistic vision. https://civitai.com/models/4201/realistic-vision-v60-b1

  70. [110]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13142--13153, 2023

  71. [111]

    Structure and content-guided video synthesis with diffusion models

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 7346--7356, 2023

  72. [112]

    arXiv preprint arXiv:2311.16933 , year =

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023 a

  73. [113]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023 b

  74. [114]

    Photorealistic video generation with diffusion models

    Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, and Jos \'e Lezama. Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662, 2023

  75. [115]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022

  76. [116]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022

  77. [117]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33: 0 6840--6851, 2020

  78. [118]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

  79. [119]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv:2204.03458, 2022 b

  80. [120]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022

Showing first 80 references.