hub Mixed citations

Do generative video models understand physical principles?

Saman Motamed, Laura Culp, Kevin Swersky, Priyank Jaini, Robert Geirhos · 2025 · cs.CV · arXiv 2501.09038

Mixed citation behavior. Most common role is background (57%).

28 Pith papers citing it

Background 57% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 1 dataset 1

citation-polarity summary

background 4 baseline 1 support 1 use dataset 1

representative citing papers

PhysInOne: Visual Physics Learning and Reasoning in One Suite

cs.CV · 2026-04-10 · unverdicted · novelty 8.0

PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.

Show Me Examples: Inferring Visual Concepts from Image Sets

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.

PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models

cs.CV · 2026-06-25 · unverdicted · novelty 7.0 · 2 refs

PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.

Benchmarking Single-Factor Physical Video-to-Audio Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.

CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA

MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.

OSCBench: Benchmarking Object State Change in Text-to-Video Generation

cs.CV · 2026-03-12 · unverdicted · novelty 7.0

OSCBench demonstrates that text-to-video models produce inaccurate and temporally inconsistent object state changes, with performance dropping sharply on novel and compositional action scenarios.

Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

cs.CL · 2026-02-16 · conditional · novelty 7.0

BasPhyCo is the first physical commonsense reasoning dataset for Basque and dialects, showing LLMs have limited performance on verifiability tasks especially with dialects.

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

cs.RO · 2025-05-19 · unverdicted · novelty 7.0

DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.

Streaming Video Generation with Streaming Force Control

cs.CV · 2026-06-05 · unverdicted · novelty 6.0

StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.

NewtPhys: Do Foundation Models Understand Newtonian Physics?

cs.CV · 2026-06-02 · unverdicted · novelty 6.0

NewtPhys dataset with real-scene physics annotations reveals limitations in low-level Newtonian reasoning across 56 VLMs and 10 VFMs.

NEWTON: Agentic Planning for Physically Grounded Video Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

PH-Dreamer integrates a port-Hamiltonian framework into generative world models to enforce physical priors, yielding tighter imagined-real reward alignment and reduced latent space volume on visual control benchmarks.

Learning Long-term Motion Embeddings for Efficient Kinematics Generation

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.

ProPhy: Progressive Physical Alignment for Dynamic World Simulation

cs.CV · 2025-12-05 · unverdicted · novelty 6.0

ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.

Video models are zero-shot learners and reasoners

cs.LG · 2025-09-24 · unverdicted · novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

cs.RO · 2025-07-01 · unverdicted · novelty 6.0

RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

cs.CV · 2025-05-23 · conditional · novelty 6.0

FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.

MAGI-1: Autoregressive Video Generation at Scale

cs.CV · 2025-05-19 · unverdicted · novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

Tempered Self-Similarity Alignment for Physically Plausible Video Generation

cs.CV · 2026-05-24 · unverdicted · novelty 5.0

Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.

Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

cs.CV · 2026-04-09 · unverdicted · novelty 5.0 · 2 refs

Phantom jointly models visual content and latent physical dynamics via a physics-aware video representation to generate physically consistent videos.

MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models

cs.CV · 2025-11-23 · unverdicted · novelty 5.0

MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.

citing papers explorer

Showing 28 of 28 citing papers.

PhysInOne: Visual Physics Learning and Reasoning in One Suite cs.CV · 2026-04-10 · unverdicted · none · ref 63 · internal anchor
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
Show Me Examples: Inferring Visual Concepts from Image Sets cs.CV · 2026-07-02 · unverdicted · none · ref 35 · internal anchor
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models cs.CV · 2026-06-25 · unverdicted · none · ref 19 · 2 links · internal anchor
PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
Benchmarking Single-Factor Physical Video-to-Audio Generation cs.CV · 2026-05-28 · unverdicted · none · ref 47 · internal anchor
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models cs.CV · 2026-05-22 · unverdicted · none · ref 37 · internal anchor
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls cs.CV · 2026-05-19 · unverdicted · none · ref 17 · internal anchor
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
MechVerse: Evaluating Physical Motion Consistency in Video Generation Models cs.CV · 2026-05-14 · unverdicted · none · ref 30 · internal anchor
MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.
OSCBench: Benchmarking Object State Change in Text-to-Video Generation cs.CV · 2026-03-12 · unverdicted · none · ref 2 · internal anchor
OSCBench demonstrates that text-to-video models produce inaccurate and temporally inconsistent object state changes, with performance dropping sharply on novel and compositional action scenarios.
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque cs.CL · 2026-02-16 · conditional · none · ref 9 · internal anchor
BasPhyCo is the first physical commonsense reasoning dataset for Basque and dialects, showing LLMs have limited performance on verifiability tasks especially with dialects.
DreamGen: Unlocking Generalization in Robot Learning through Video World Models cs.RO · 2025-05-19 · unverdicted · none · ref 27 · internal anchor
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
Streaming Video Generation with Streaming Force Control cs.CV · 2026-06-05 · unverdicted · none · ref 42 · internal anchor
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
NewtPhys: Do Foundation Models Understand Newtonian Physics? cs.CV · 2026-06-02 · unverdicted · none · ref 38 · internal anchor
NewtPhys dataset with real-scene physics annotations reveals limitations in low-level Newtonian reasoning across 56 VLMs and 10 VFMs.
NEWTON: Agentic Planning for Physically Grounded Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 23 · internal anchor
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics cs.LG · 2026-05-18 · unverdicted · none · ref 25 · internal anchor
PH-Dreamer integrates a port-Hamiltonian framework into generative world models to enforce physical priors, yielding tighter imagined-real reward alignment and reduced latent space volume on visual control benchmarks.
Learning Long-term Motion Embeddings for Efficient Kinematics Generation cs.CV · 2026-04-13 · unverdicted · none · ref 29 · internal anchor
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
ProPhy: Progressive Physical Alignment for Dynamic World Simulation cs.CV · 2025-12-05 · unverdicted · none · ref 20 · internal anchor
ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility cs.CV · 2025-09-29 · unverdicted · none · ref 18 · internal anchor
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
Video models are zero-shot learners and reasoners cs.LG · 2025-09-24 · unverdicted · none · ref 47 · internal anchor
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations cs.RO · 2025-07-01 · unverdicted · none · ref 82 · internal anchor
RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving cs.CV · 2025-05-23 · conditional · none · ref 46 · internal anchor
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
MAGI-1: Autoregressive Video Generation at Scale cs.CV · 2025-05-19 · unverdicted · none · ref 33 · internal anchor
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Tempered Self-Similarity Alignment for Physically Plausible Video Generation cs.CV · 2026-05-24 · unverdicted · none · ref 36 · internal anchor
Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics cs.CV · 2026-04-09 · unverdicted · none · ref 29 · 2 links · internal anchor
Phantom jointly models visual content and latent physical dynamics via a physics-aware video representation to generate physically consistent videos.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models cs.CV · 2025-11-23 · unverdicted · none · ref 38 · internal anchor
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving cs.CV · 2026-05-11 · unverdicted · none · ref 136 · internal anchor
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
World Simulation with Video Foundation Models for Physical AI cs.CV · 2025-10-28 · unverdicted · none · ref 53 · internal anchor
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI cs.AI · 2025-10-06 · unverdicted · none · ref 165 · internal anchor
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
Evolution of Video Generative Foundations cs.CV · 2026-04-07 · unverdicted · none · ref 169 · internal anchor
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

Do generative video models understand physical principles?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer