A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

Barada Sahu; Shivesh Pandey

arxiv: 2607.01400 · v1 · pith:N6BF2J6Fnew · submitted 2026-07-01 · 💻 cs.SE · cs.LG· q-bio.NC

A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps

Barada Sahu , Shivesh Pandey This is my paper

Pith reviewed 2026-07-03 19:17 UTC · model grok-4.3

classification 💻 cs.SE cs.LGq-bio.NC

keywords brain encodingfMRI predictionYouTube replay heatmapsbehavioral engagementnull resultmultimodal modelsglobal field power

0 comments

The pith

A top brain-encoding model's predicted fMRI signals show no link to YouTube replay heatmaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether predicted cortical responses from the TRIBE model, which won a 2025 brain-encoding challenge, can forecast which moments viewers re-watch on YouTube videos. It reduces the model's output to a single per-second engagement curve called global field power and correlates it against replay heatmaps from 48 videos. The result is a near-zero correlation that does not exceed simple audio loudness or motion baselines. A sympathetic reader would care because successful brain models are often assumed to capture the neural basis of real behavioral engagement, yet here that link fails to appear.

Core claim

The global field power derived from TRIBE-predicted cortical responses across six networks shows no evidence of predicting YouTube most-replayed heatmaps, with a pooled position-controlled partial correlation of +0.058 that is statistically indistinguishable from zero and not above baseline measures.

What carries the argument

The global field power, a per-second scalar curve obtained by reducing the model's predicted cortical response volume.

If this is right

The null result is not explained by autocorrelation and holds under permutation testing.
Moderate correlations seen only in music videos trace to an onset artifact rather than true content prediction.
The predicted signal performs no better than simple loudness or motion features.
The finding is consistent across six different cortical-network readouts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If replay heatmaps truly index engagement, models may need finer-grained or behavior-tuned readouts instead of a single global drive signal.
The result raises the possibility that current encoding models capture sensory processing but miss higher-order factors that drive repeated viewing.
An acquisition method for YouTube heatmaps despite platform restrictions could enable similar tests with other behavioral proxies.

Load-bearing premise

YouTube most-replayed heatmaps serve as a valid proxy for content-driven behavioral engagement that cortical responses should be able to predict.

What would settle it

A statistically significant positive correlation between the global field power and replay heatmaps on an independent set of videos that survives the same position-controlled and permutation controls.

Figures

Figures reproduced from arXiv: 2607.01400 by Barada Sahu, Shivesh Pandey.

**Figure 1.** Figure 1: No content-level prediction of re-watch behavior. (a) Per-video raw and positioncontrolled correlations with most-replayed; the partial correlation (mean ± 95% CI) is centered on zero and the CI crosses it. (b) Pooled partial correlation: TRIBE is statistically indistinguishable from the loudness baseline and near zero. (c) Per-category partial correlations are small, signinconsistent, and dominated by n… view at source ↗

read the original abstract

Deep multimodal brain-encoding models now predict fMRI responses to naturalistic video with high accuracy. Whether their predicted neural signals also forecast behavioral engagement is unknown. We run TRIBE, the winning model of the 2025 Algonauts brain-encoding challenge (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT), on 48 YouTube videos and reduce its predicted cortical response to a per-second engagement curve, the global field power. Correlated against each video's "most replayed" heatmap, a passively-collected proxy for which moments viewers return to, the curve shows no evidence of predicting re-watch behavior. The pooled position-controlled partial correlation is +0.058 (95% CI [-0.04, 0.15]; one-sample t(47)=1.21, p=0.23), indistinguishable from zero and not significantly above simple loudness and motion baselines (loudness +0.04, paired p=0.74). The raw correlation is also near zero; the moderate values reported for music videos reflect a genre-specific intro/onset-replay artifact rather than content prediction, and do not generalize. The null holds across six cortical-network readouts and under an autocorrelation-preserving permutation test. We release the code, the video-ID manifest, and an acquisition method that works despite YouTube's SABR-only streaming.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a clean null: TRIBE's predicted fMRI curve does not track YouTube replay heatmaps after position control.

read the letter

The core finding is straightforward. Across 48 videos the position-controlled partial correlation between the TRIBE-derived global field power and YouTube most-replayed heatmaps sits at +0.058 (CI crosses zero), indistinguishable from loudness or motion baselines. The null survives six network readouts and an autocorrelation-preserving permutation test, and the authors flag a music-video onset artifact that explains earlier moderate correlations.

What the work actually adds is a direct, out-of-sample check of a 2025 Algonauts-winning multimodal model against an independently collected behavioral dataset. Releasing the video manifest and code is helpful for anyone who wants to rerun or extend the test.

The soft spot is the proxy itself. The claim that a null result narrows expectations for applied uses rests on replay heatmaps being a content-driven engagement signal that cortical responses should predict. The abstract gives no external validation against eye-tracking, self-report, or physiological measures once low-level features are controlled, so the result could reflect either model failure or heatmap variance driven by recommendation algorithms and thumbnails. Video selection criteria and exact preprocessing steps are also absent from the abstract, which leaves the usual questions about post-hoc choices.

This is for labs already working on brain-encoding transfer to behavior. A reader who cares about whether these models generalize beyond fMRI will get a useful data point, but only if the proxy assumption is accepted. The paper is honest about its null and does not overclaim, so it deserves a serious referee to examine the methods and the proxy discussion in full.

Referee Report

2 major / 1 minor

Summary. The paper reports a null result: a global field power time series derived from TRIBE (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT) predicted fMRI responses to 48 YouTube videos shows no significant correlation with the videos' YouTube 'most replayed' heatmaps. The key statistic is a pooled position-controlled partial correlation of +0.058 (95% CI [-0.04, 0.15]; t(47)=1.21, p=0.23), indistinguishable from zero and not superior to loudness (+0.04) or motion baselines. The null survives six cortical-network readouts and autocorrelation-preserving permutation tests. The authors note a genre-specific intro artifact in music videos and release code, video-ID manifest, and an acquisition workaround.

Significance. If the null result is robust, the finding is significant as a direct test of whether high-accuracy brain-encoding models' predicted cortical signals carry information about behavioral engagement (re-watch behavior). It provides a falsifiable negative outcome with explicit statistical controls and data release, which are strengths. The result would constrain claims that neural prediction accuracy implies downstream behavioral utility, particularly for passively collected engagement proxies.

major comments (2)

[Abstract] Abstract (paragraph on replay heatmaps as proxy): The central null claim that the TRIBE-derived signal 'does not predict behavioral engagement' is load-bearing on the assumption that YouTube replay heatmaps are a valid, content-sensitive proxy whose variance should be captured by predicted cortical responses after position and low-level controls. No external validation (e.g., correlation with eye-tracking dwell time, self-reported engagement, or physiological measures after the same controls) is reported; if heatmap variance is dominated by recommendation algorithms or thumbnail effects, the null does not test the model's engagement prediction capacity.
[Methods] Methods (video selection, preprocessing, and readout definitions): Full details on video selection criteria, exact computation of the global field power from TRIBE outputs, fMRI prediction preprocessing steps, and how the six network readouts are extracted are absent. This prevents evaluation of whether the null could arise from post-hoc choices, inadequate baseline matching, or video sampling biases, directly affecting the soundness of the central statistical claim.

minor comments (1)

[Abstract] Abstract: The statement that 'the moderate values reported for music videos reflect a genre-specific intro/onset-replay artifact' would benefit from a quantitative breakdown (e.g., separate correlations for music vs. non-music videos) to support the generalization claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below. We agree that expanding methodological details will improve the manuscript and will do so in revision. We also acknowledge the proxy limitation and will add explicit discussion of it.

read point-by-point responses

Referee: [Abstract] Abstract (paragraph on replay heatmaps as proxy): The central null claim that the TRIBE-derived signal 'does not predict behavioral engagement' is load-bearing on the assumption that YouTube replay heatmaps are a valid, content-sensitive proxy whose variance should be captured by predicted cortical responses after position and low-level controls. No external validation (e.g., correlation with eye-tracking dwell time, self-reported engagement, or physiological measures after the same controls) is reported; if heatmap variance is dominated by recommendation algorithms or thumbnail effects, the null does not test the model's engagement prediction capacity.

Authors: We agree that replay heatmaps constitute an indirect proxy and that the study does not include external validation against eye-tracking, self-report, or physiological measures. The manuscript frames them as a standard passively collected engagement signal. The reported null (after position and low-level controls) remains informative even under this limitation, as it indicates the predicted signal explains no additional variance in the proxy. We will revise the abstract and add a limitations paragraph noting potential confounds from recommendation algorithms and thumbnails, without claiming stronger validation than the data support. revision: partial
Referee: [Methods] Methods (video selection, preprocessing, and readout definitions): Full details on video selection criteria, exact computation of the global field power from TRIBE outputs, fMRI prediction preprocessing steps, and how the six network readouts are extracted are absent. This prevents evaluation of whether the null could arise from post-hoc choices, inadequate baseline matching, or video sampling biases, directly affecting the soundness of the central statistical claim.

Authors: We accept that the submitted version omitted sufficient methodological detail. The 48 videos were selected for genre diversity from publicly available sources used in the Algonauts challenge; global field power is computed as the per-second mean absolute value across predicted voxels; preprocessing follows standard fMRI pipelines with the six readouts taken from the challenge-defined cortical networks. We will expand the Methods section with explicit selection criteria, the precise GFP formula, preprocessing steps, and readout extraction procedures to enable full evaluation and replication. revision: yes

Circularity Check

0 steps flagged

No circularity: null result from direct external validation

full rationale

The paper derives a global field power curve from the pre-trained TRIBE model applied to 48 YouTube videos and performs a direct statistical correlation against independently collected replay heatmaps. No parameters are fitted to the target heatmaps, no equations reduce the reported partial correlation to a self-defined quantity, and the null result (including controls and permutation tests) stands as an external test rather than a re-expression of inputs. The analysis is self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the central test rests on one domain assumption about the validity of replay heatmaps as engagement proxy and on standard statistical assumptions for partial correlation and permutation testing.

axioms (1)

domain assumption YouTube 'most replayed' heatmaps constitute a valid proxy for content-driven behavioral engagement that cortical signals should predict
Invoked when the authors treat lack of correlation as evidence that the model does not predict re-watch behavior

pith-pipeline@v0.9.1-grok · 5787 in / 1325 out tokens · 26061 ms · 2026-07-03T19:17:25.624162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Revisiting Feature Prediction for Learning Visual Representations from Video

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. G. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

G. S. Berns and S. E. Moore. A neural predictor of cultural popularity. Journal of Consumer Psychology, 22(1):154--160, 2012

work page 2012
[3]

Tribe: Trimodal brain encoder for whole-brain fmri response prediction

S. d'Ascoli, J. Rapin, Y. Benchetrit, H. Banville, and J.-R. King. TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction. arXiv:2507.22229, 2025

work page arXiv 2025
[4]

J. P. Dmochowski, M. A. Bezdek, B. P. Abelson, J. S. Johnson, E. H. Schumacher, and L. C. Parra. Audience preferences are predicted by temporal reliability of neural processing. Nature Communications, 5:4567, 2014

work page 2014
[5]

Genevsky, C

A. Genevsky, C. Yoon, and B. Knutson. When brain beats behavior: Neuroforecasting crowdfunding outcomes. Journal of Neuroscience, 37(36):8625--8634, 2017

work page 2017
[6]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Hasson, Y

U. Hasson, Y. Nir, I. Levy, G. Fuhrmann, and R. Malach. Intersubject synchronization of cortical activity during natural vision. Science, 303(5664):1634--1640, 2004

work page 2004

[1] [1]

Revisiting Feature Prediction for Learning Visual Representations from Video

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. G. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

G. S. Berns and S. E. Moore. A neural predictor of cultural popularity. Journal of Consumer Psychology, 22(1):154--160, 2012

work page 2012

[3] [3]

Tribe: Trimodal brain encoder for whole-brain fmri response prediction

S. d'Ascoli, J. Rapin, Y. Benchetrit, H. Banville, and J.-R. King. TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction. arXiv:2507.22229, 2025

work page arXiv 2025

[4] [4]

J. P. Dmochowski, M. A. Bezdek, B. P. Abelson, J. S. Johnson, E. H. Schumacher, and L. C. Parra. Audience preferences are predicted by temporal reliability of neural processing. Nature Communications, 5:4567, 2014

work page 2014

[5] [5]

Genevsky, C

A. Genevsky, C. Yoon, and B. Knutson. When brain beats behavior: Neuroforecasting crowdfunding outcomes. Journal of Neuroscience, 37(36):8625--8634, 2017

work page 2017

[6] [6]

The Llama 3 Herd of Models

A. Grattafiori et al. The Llama 3 herd of models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Hasson, Y

U. Hasson, Y. Nir, I. Levy, G. Fuhrmann, and R. Malach. Intersubject synchronization of cortical activity during natural vision. Science, 303(5664):1634--1640, 2004

work page 2004