pith. sign in

arxiv: 2607.00047 · v1 · pith:52XLFEHJnew · submitted 2026-06-29 · 💻 cs.MM · cs.CV· cs.GR

Vertigo Vertigo: Reconstructing a Cinematic Ideal through its Predictive AI Double

Pith reviewed 2026-07-02 20:24 UTC · model grok-4.3

classification 💻 cs.MM cs.CVcs.GR
keywords Vertigo reconstructionvideo diffusion modelkeyframe interpolationcinematic normsgenerative mediaHitchcockAI authenticitylatent priors
0
0 comments X

The pith

A video diffusion model reconstructs Vertigo from 2.78% keyframes, producing 73.1% recognizable frames.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reconstructs Hitchcock's Vertigo scene-by-scene by interpolating between a sparse set of 2.78% keyframe anchors using a large video diffusion model. It reports that 73.1% of the resulting frames read as plausible versions of the original film while only 3.6% fail outright. These numbers are offered as evidence that the normative conventions of classical cinema are already compressed inside the model's latent priors. The reconstruction is shown as an unstable overlay that extends the film's own theme of obsessive ideal-making, leading to the claim that generative media accelerates rather than replaces cinema's logic of desire and false authenticity.

Core claim

Using only 2.78% of the original frames as keyframe anchors, first-last frame interpolation via a large video diffusion model generates the intervening sequences of Vertigo. Computational analysis shows 73.1% of frames are recognizable as plausible renditions of the film and only 3.6% fail catastrophically. This fidelity indicates that cinematic norms are deeply compressed within the model's latent priors. The reconstruction is rendered as an unstable overlay between the original and its predictive shadow, extending the film's theme of obsessive reconstruction of an artificial ideal to the film itself and arguing that generative media accelerates cinema's logic of desire and false authentici

What carries the argument

Sparse keyframe-based first-last frame interpolation with a video diffusion model that predicts intervening sequences from the 2.78% anchors.

If this is right

  • Cinematic norms of classical Hollywood are encoded in the latent space of large video diffusion models.
  • The generated artifact creates an unstable overlay that induces doubt about authenticity in viewers.
  • Generative media represents an acceleration of cinema's logic of desire and false authenticity rather than a paradigm shift.
  • The method can serve as a probe for normative conventions encoded in generative systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the same sparse-keyframe method to films with unconventional editing could reveal which conventions are most strongly represented in current models.
  • The perceptual doubt created by the overlay may parallel viewer experiences with deepfakes and other AI-generated media.
  • Measuring fidelity across a range of films could map the degree to which film grammar has been compressed in existing diffusion systems.

Load-bearing premise

High recognizability of the interpolated frames demonstrates that the model has internalized specific cinematic norms rather than simply generating generic plausible motion.

What would settle it

Re-running the interpolation on the same keyframes with a diffusion model trained only on non-classical footage and measuring whether the recognizable-frame rate falls below 50%.

Figures

Figures reproduced from arXiv: 2607.00047 by Adam Cole, Mick Grierson.

Figure 1
Figure 1. Figure 1: Vertigo Vertigo (2026): an AI reconstruction of Hitchcock’s Vertigo built from 2.78% of the original film’s frames, here showing the original overlaid with its predictive shadow. Abstract Vertigo Vertigo is a scene-for-scene AI reconstruction of Hitchcock’s Vertigo (1958), generated from only 2.78% of the original film’s frames. Using this sparse set of keyframe anchors, we perform first￾last frame interpo… view at source ↗
Figure 2
Figure 2. Figure 2: Computational Analysis Suite. Left: Composite divergence across all 122,668 frames of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative frames for each tier of reconstruction quality using frame-wise pixel difference, where darker regions [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Frame-level reconstruction quality across all frames [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative failure in emotional rendering. In this [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative success in high-fidelity reconstruction [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Representative frames from across the artwork, where the original [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

Vertigo Vertigo is a scene-for-scene AI reconstruction of Hitchcock's Vertigo (1958), generated from only 2.78% of the original film's frames. Using this sparse set of keyframe anchors, we perform first-last frame interpolation via a large video diffusion model to predict the intervening sequences. Vertigo is itself a film about the obsessive reconstruction of an artificial ideal; Vertigo Vertigo extends this logic to the material of the film, treating the canonical text as a probe for the normative conventions of classical cinema encoded within generative systems. Evaluated through computational analysis and critical feedback from media theorists (Lev Manovich, Shane Denson, Kevin L. Ferguson), the artifact demonstrates remarkable structural fidelity: 73.1% of frames are recognizable as plausible renditions of Vertigo and only 3.6% fail catastrophically. This fidelity suggests that cinematic norms are deeply compressed within the model's latent priors. Aesthetically, the reconstruction is rendered as an unstable overlay between the original film and its predictive shadow, fueling a persistent doubt in the viewer's perception of authenticity -- a 21st-century vertigo. The work argues that generative media is not a paradigm shift from cinema but an acceleration of its logic of desire and false authenticity, extending from classical Hollywood through to the predictive media environments now reshaping contemporary perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper presents 'Vertigo Vertigo,' a scene-for-scene AI reconstruction of Hitchcock's 1958 film Vertigo generated via first-last frame interpolation from a sparse 2.78% keyframe set using a large video diffusion model. It reports 73.1% of frames as recognizable as plausible renditions of the original and 3.6% as catastrophic failures, interpreting these figures as evidence that cinematic norms are deeply compressed in the model's latent priors; the work is framed as both a technical artifact and a critical extension of the film's themes of obsessive reconstruction and authenticity into generative media.

Significance. If the quantitative claims were supported by a documented protocol and controls, the project could contribute to multimedia research on how diffusion models encode filmic conventions and to interdisciplinary discussions between generative AI and media theory. The artistic framing and cited feedback from theorists add a novel angle, but the absence of methodological detail currently prevents assessing whether the result holds or generalizes.

major comments (3)
  1. [Abstract] Abstract: the reported figures of 73.1% recognizable frames and 3.6% catastrophic failures are presented without any description of the computational analysis method, definition of 'recognizable' or 'catastrophic,' evaluation protocol, or inter-rater reliability measures, rendering the central fidelity claim unevaluable.
  2. [Abstract] Abstract: the interpretive leap that the observed fidelity demonstrates compression of Vertigo-specific cinematic norms in latent priors is unsupported by any ablation, baseline (e.g., random keyframes or cross-film conditioning), or external validation, leaving open the possibility that results reflect only generic plausible-motion synthesis.
  3. [Abstract] Abstract: no details are given on how the 2.78% keyframe selection was performed or whether the interpolation procedure was tuned specifically to Vertigo, which is required to substantiate the claim that the reconstruction probes normative conventions rather than model defaults.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important gaps in methodological transparency that we will address through revision. Below we respond point by point to the three major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported figures of 73.1% recognizable frames and 3.6% catastrophic failures are presented without any description of the computational analysis method, definition of 'recognizable' or 'catastrophic,' evaluation protocol, or inter-rater reliability measures, rendering the central fidelity claim unevaluable.

    Authors: We agree that the abstract omits these details. The full manuscript contains an evaluation section, but it does not sufficiently foreground the protocol. In revision we will expand both the abstract and methods to define 'recognizable' (frames preserving core visual motifs, character positions, and scene composition as judged by automated perceptual metrics plus theorist review) and 'catastrophic' (frames with major structural collapse or identity loss), describe the analysis pipeline, and report any inter-rater statistics. revision: yes

  2. Referee: [Abstract] Abstract: the interpretive leap that the observed fidelity demonstrates compression of Vertigo-specific cinematic norms in latent priors is unsupported by any ablation, baseline (e.g., random keyframes or cross-film conditioning), or external validation, leaving open the possibility that results reflect only generic plausible-motion synthesis.

    Authors: The interpretation rests on the combination of quantitative fidelity and direct commentary from the cited media theorists who viewed the reconstruction and identified its adherence to classical cinematic conventions. We nevertheless accept that ablations would strengthen the claim. In revision we will add an explicit limitations subsection that discusses alternative explanations (generic motion synthesis) and outlines planned baseline experiments, while retaining the current framing as a theoretically motivated probe rather than a controlled causal demonstration. revision: partial

  3. Referee: [Abstract] Abstract: no details are given on how the 2.78% keyframe selection was performed or whether the interpolation procedure was tuned specifically to Vertigo, which is required to substantiate the claim that the reconstruction probes normative conventions rather than model defaults.

    Authors: We will revise the methods section to specify the keyframe selection protocol (uniform sampling of one keyframe per shot boundary plus additional frames at major scene transitions, yielding the reported 2.78 % density) and to state that the diffusion model was used in its publicly released form with no Vertigo-specific fine-tuning or prompt engineering beyond the first-last frame conditioning. revision: yes

Circularity Check

1 steps flagged

Fidelity of model-generated frames cited as evidence that the model internalized cinematic norms

specific steps
  1. fitted input called prediction [Abstract]
    "Using this sparse set of keyframe anchors, we perform first-last frame interpolation via a large video diffusion model to predict the intervening sequences. [...] the artifact demonstrates remarkable structural fidelity: 73.1% of frames are recognizable as plausible renditions of Vertigo and only 3.6% fail catastrophically. This fidelity suggests that cinematic norms are deeply compressed within the model's latent priors."

    The 73.1% figure is obtained by running the diffusion model on the chosen keyframes; that same figure is then invoked as evidence that the model has internalized the relevant norms. Without a control condition (random keyframes, cross-film anchors, or non-Vertigo footage) the rate cannot distinguish specific compression from generic interpolation capability.

full rationale

The paper generates the reconstruction via diffusion interpolation from 2.78% keyframes, then treats the resulting 73.1% recognizability rate as direct proof that 'cinematic norms are deeply compressed within the model's latent priors.' No baseline, ablation, or external validation is described that would separate Vertigo-specific capture from the model's generic ability to synthesize plausible motion. This makes the interpretive claim reduce to the output of the same process it seeks to explain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the unstated assumption that recognizability metrics from a diffusion model can serve as evidence for cultural encoding without additional controls; no free parameters or invented entities are explicitly listed because the text is abstract-only.

axioms (1)
  • domain assumption The output of a large video diffusion model on sparse film keyframes can be treated as a probe for normative cinematic conventions encoded in its training data.
    Invoked in the abstract when moving from fidelity percentages to the claim about latent priors.

pith-pipeline@v0.9.1-grok · 5771 in / 1333 out tokens · 21120 ms · 2026-07-02T20:24:39.808419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    British Film Institute. 2022. The Greatest Films of All Time. https://www.bfi.org.uk/sight-and-sound/polls/greatest-films-all-time

  2. [2]

    Terence Broad and Mick Grierson. 2017. Autoencoding Blade Runner: Recon- structing Films with Artificial Neural Networks. InACM SIGGRAPH 2017 Art Gallery (SIGGRAPH ’17). Association for Computing Machinery, New York, NY, USA, 376–383. doi:10.1145/3072940.3072964

  3. [3]

    2024.Cinema and Machine Vision

    Daniel Chávez Heras. 2024.Cinema and Machine Vision. Edinburgh University Press

  4. [4]

    2020.Discorrelated Images

    Shane Denson. 2020.Discorrelated Images. Duke University Press

  5. [5]

    Ferguson

    Kevin L. Ferguson. 2015. Volumetric Cinema.[in]Transition2, 1 (March 2015). doi:10.16995/intransition.11331

  6. [6]

    Google. 2024. State-of-the-Art Video and Image Generation with Veo 2 and Imagen 3

  7. [7]

    Douglas Gordon. 1993. 24 Hour Psycho

  8. [8]

    Grégory Chatonsky. 2007. Vertigo@home

  9. [9]

    Alfred Hitchcock. 1958. Vertigo

  10. [10]

    Rachel Maclean. 2024. DUCK

  11. [11]

    2001.The Language of New Media

    Lev Manovich. 2001.The Language of New Media. MIT Press

  12. [12]

    Chris Marker. 1983. Sans Soleil. Vertigo Vertigo: Reconstructing a Cinematic Ideal through its Predictive AI Double

  13. [13]

    Laura Mulvey. 1975. Visual Pleasure and Narrative Cinema.Screen16, 3 (Oct. 1975), 6–18. doi:10.1093/screen/16.3.6

  14. [14]

    Gus Van Sant. 1998. Psycho

  15. [15]

    WanTeam, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, Ti...

  16. [16]

    Asuka Yamazaki. 2024. Digital Replicas and Democracy: Issues Raised by the Hollywood Actors’ Strike.Humanities and Social Sciences Communications11, 1 (Dec. 2024), 1661. doi:10.1057/s41599-024-04204-w

  17. [17]

    1970.Expanded Cinema

    Gene Youngblood. 1970.Expanded Cinema. Dutton

  18. [18]

    Ruihan Zhang, Borou Yu, Jiajian Min, Yetong Xin, Zheng Wei, Juncheng Nemo Shi, Mingzhen Huang, Xianghao Kong, Nix Liu Xin, Shanshan Jiang, Praagya Bahuguna, Mark Chan, Khushi Hora, Lijian Yang, Yongqi Liang, Runhe Bian, Yunlei Liu, Isabela Campillo Valencia, Patricia Morales Tredinick, Ilia Kozlov, Sijia Jiang, Peiwen Huang, Na Chen, Xuanxuan Liu, and Any...