PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
hub Mixed citations
Do generative video models understand physical principles?
Mixed citation behavior. Most common role is background (57%).
abstract
AI video generation is undergoing a revolution, with quality and realism advancing rapidly. These advances have led to a passionate scientific debate: Do video models learn "world models" that discover laws of physics -- or, alternatively, are they merely sophisticated pixel predictors that achieve visual realism without understanding the physical principles of reality? We address this question by developing Physics-IQ, a comprehensive benchmark dataset that can only be solved by acquiring a deep understanding of various physical principles, like fluid dynamics, optics, solid mechanics, magnetism and thermodynamics. We find that across a range of current models (Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet), physical understanding is severely limited, and unrelated to visual realism. At the same time, some test cases can already be successfully solved. This indicates that acquiring certain physical principles from observation alone may be possible, but significant challenges remain. While we expect rapid advances ahead, our work demonstrates that visual realism does not imply physical understanding. Our project page is at https://physics-iq.github.io; code at https://github.com/google-deepmind/physics-IQ-benchmark.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.
OSCBench demonstrates that text-to-video models produce inaccurate and temporally inconsistent object state changes, with performance dropping sharply on novel and compositional action scenarios.
BasPhyCo is the first physical commonsense reasoning dataset for Basque and dialects, showing LLMs have limited performance on verifiability tasks especially with dialects.
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
NewtPhys dataset with real-scene physics annotations reveals limitations in low-level Newtonian reasoning across 56 VLMs and 10 VFMs.
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
PH-Dreamer integrates a port-Hamiltonian framework into generative world models to enforce physical priors, yielding tighter imagined-real reward alignment and reduced latent space volume on visual control benchmarks.
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.
Phantom jointly models visual content and latent physical dynamics via a physics-aware video representation to generate physically consistent videos.
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
citing papers explorer
-
PhysInOne: Visual Physics Learning and Reasoning in One Suite
PhysInOne is a new dataset of 2 million videos across 153,810 dynamic 3D scenes covering 71 physical phenomena, shown to improve AI performance on physics-aware video generation, prediction, property estimation, and motion transfer.
-
Show Me Examples: Inferring Visual Concepts from Image Sets
Introduces VICIS task and training framework for inferring visual concepts from image sets, with experiments showing better accuracy, diversity, and generalization than standard VLMs on synthetic and ImageNet data.
-
PhysEditWorld: A Large-Scale Dataset Toward Physics-Editable World Models
PhysEditWorld is a new dataset of over 60 million frames from 12 UE5 cinematic scenes with synchronized multimodal signals and explicit gravity labels, built via replay to support physics-editable world models.
-
Benchmarking Single-Factor Physical Video-to-Audio Generation
FlatSounds benchmark shows state-of-the-art V2A models rely more on text captions than visual input for physical and semantic accuracy, with captions improving correctness but degrading temporal alignment.
-
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
CRONOS benchmark shows recent open-source video generators fail to preserve physical consistency under controlled changes to viewpoint, scene, object category, and appearance.
-
Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls
Aero-World adapts a pretrained latent diffusion transformer for action-conditioned aerial video generation by injecting inertial action tokens and using a frozen latent-space Physics Probe for inertial consistency supervision during LoRA finetuning, with a new AeroBench benchmark showing improved AA
-
MechVerse: Evaluating Physical Motion Consistency in Video Generation Models
MechVerse benchmark shows current video generation models preserve appearance but fail at mechanically admissible motion, with errors rising as coupling complexity increases.
-
OSCBench: Benchmarking Object State Change in Text-to-Video Generation
OSCBench demonstrates that text-to-video models produce inaccurate and temporally inconsistent object state changes, with performance dropping sharply on novel and compositional action scenarios.
-
Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque
BasPhyCo is the first physical commonsense reasoning dataset for Basque and dialects, showing LLMs have limited performance on verifiability tasks especially with dialects.
-
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
DreamGen trains robot policies on synthetic trajectories from adapted video world models, enabling a humanoid robot to perform 22 new behaviors in seen and unseen environments from a single pick-and-place teleoperation dataset.
-
Streaming Video Generation with Streaming Force Control
StreamForce presents a unified causal model for force-controllable streaming video generation using a new force representation and distillation pipeline, claiming SOTA force adherence and 16.6 FPS performance.
-
NewtPhys: Do Foundation Models Understand Newtonian Physics?
NewtPhys dataset with real-scene physics annotations reveals limitations in low-level Newtonian reasoning across 56 VLMs and 10 VFMs.
-
NEWTON: Agentic Planning for Physically Grounded Video Generation
NEWTON improves physical accuracy in video generation by deploying a trainable planner that coordinates physics-aware tools and a verifier, raising joint accuracy on VideoPhy-2 without altering the base generators.
-
PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics
PH-Dreamer integrates a port-Hamiltonian framework into generative world models to enforce physical priors, yielding tighter imagined-real reward alignment and reduced latent space volume on visual control benchmarks.
-
Learning Long-term Motion Embeddings for Efficient Kinematics Generation
A 64x temporally compressed motion embedding learned from trackers enables efficient conditional flow-matching generation of long-term motions that outperform video models and task-specific methods.
-
ProPhy: Progressive Physical Alignment for Dynamic World Simulation
ProPhy adds explicit physics-aware conditioning via semantic and refinement experts plus VLM knowledge transfer to produce more physically coherent dynamic videos than prior methods.
-
Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
A training-free framework uses physics-violating counterfactual prompts and Synchronized Decoupled Guidance to suppress implausible motions in diffusion-based video generation while preserving photorealism.
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations
RIGVid shows that filtered AI-generated videos can serve as effective supervision for complex robotic manipulation tasks without any real demonstrations.
-
FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
FSDrive uses a generated future scene frame as visual spatio-temporal CoT to improve VLA models for safer autonomous driving trajectory prediction.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
Tempered Self-Similarity Alignment for Physically Plausible Video Generation
Tempered Self-similarity Alignment transfers relational structure from foundation-model STSS into video generators via probabilistic correspondence alignment, yielding reported gains in physical plausibility on VideoPhy benchmarks.
-
Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Phantom jointly models visual content and latent physical dynamics via a physics-aware video representation to generate physically consistent videos.
-
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
-
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
-
Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.