A global predicted-fMRI drive signal from TRIBE does not predict YouTube replay heatmaps
Pith reviewed 2026-07-03 19:17 UTC · model grok-4.3
The pith
A top brain-encoding model's predicted fMRI signals show no link to YouTube replay heatmaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The global field power derived from TRIBE-predicted cortical responses across six networks shows no evidence of predicting YouTube most-replayed heatmaps, with a pooled position-controlled partial correlation of +0.058 that is statistically indistinguishable from zero and not above baseline measures.
What carries the argument
The global field power, a per-second scalar curve obtained by reducing the model's predicted cortical response volume.
If this is right
- The null result is not explained by autocorrelation and holds under permutation testing.
- Moderate correlations seen only in music videos trace to an onset artifact rather than true content prediction.
- The predicted signal performs no better than simple loudness or motion features.
- The finding is consistent across six different cortical-network readouts.
Where Pith is reading between the lines
- If replay heatmaps truly index engagement, models may need finer-grained or behavior-tuned readouts instead of a single global drive signal.
- The result raises the possibility that current encoding models capture sensory processing but miss higher-order factors that drive repeated viewing.
- An acquisition method for YouTube heatmaps despite platform restrictions could enable similar tests with other behavioral proxies.
Load-bearing premise
YouTube most-replayed heatmaps serve as a valid proxy for content-driven behavioral engagement that cortical responses should be able to predict.
What would settle it
A statistically significant positive correlation between the global field power and replay heatmaps on an independent set of videos that survives the same position-controlled and permutation controls.
Figures
read the original abstract
Deep multimodal brain-encoding models now predict fMRI responses to naturalistic video with high accuracy. Whether their predicted neural signals also forecast behavioral engagement is unknown. We run TRIBE, the winning model of the 2025 Algonauts brain-encoding challenge (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT), on 48 YouTube videos and reduce its predicted cortical response to a per-second engagement curve, the global field power. Correlated against each video's "most replayed" heatmap, a passively-collected proxy for which moments viewers return to, the curve shows no evidence of predicting re-watch behavior. The pooled position-controlled partial correlation is +0.058 (95% CI [-0.04, 0.15]; one-sample t(47)=1.21, p=0.23), indistinguishable from zero and not significantly above simple loudness and motion baselines (loudness +0.04, paired p=0.74). The raw correlation is also near zero; the moderate values reported for music videos reflect a genre-specific intro/onset-replay artifact rather than content prediction, and do not generalize. The null holds across six cortical-network readouts and under an autocorrelation-preserving permutation test. We release the code, the video-ID manifest, and an acquisition method that works despite YouTube's SABR-only streaming.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a null result: a global field power time series derived from TRIBE (Llama-3.2 + V-JEPA2 + Wav2Vec-BERT) predicted fMRI responses to 48 YouTube videos shows no significant correlation with the videos' YouTube 'most replayed' heatmaps. The key statistic is a pooled position-controlled partial correlation of +0.058 (95% CI [-0.04, 0.15]; t(47)=1.21, p=0.23), indistinguishable from zero and not superior to loudness (+0.04) or motion baselines. The null survives six cortical-network readouts and autocorrelation-preserving permutation tests. The authors note a genre-specific intro artifact in music videos and release code, video-ID manifest, and an acquisition workaround.
Significance. If the null result is robust, the finding is significant as a direct test of whether high-accuracy brain-encoding models' predicted cortical signals carry information about behavioral engagement (re-watch behavior). It provides a falsifiable negative outcome with explicit statistical controls and data release, which are strengths. The result would constrain claims that neural prediction accuracy implies downstream behavioral utility, particularly for passively collected engagement proxies.
major comments (2)
- [Abstract] Abstract (paragraph on replay heatmaps as proxy): The central null claim that the TRIBE-derived signal 'does not predict behavioral engagement' is load-bearing on the assumption that YouTube replay heatmaps are a valid, content-sensitive proxy whose variance should be captured by predicted cortical responses after position and low-level controls. No external validation (e.g., correlation with eye-tracking dwell time, self-reported engagement, or physiological measures after the same controls) is reported; if heatmap variance is dominated by recommendation algorithms or thumbnail effects, the null does not test the model's engagement prediction capacity.
- [Methods] Methods (video selection, preprocessing, and readout definitions): Full details on video selection criteria, exact computation of the global field power from TRIBE outputs, fMRI prediction preprocessing steps, and how the six network readouts are extracted are absent. This prevents evaluation of whether the null could arise from post-hoc choices, inadequate baseline matching, or video sampling biases, directly affecting the soundness of the central statistical claim.
minor comments (1)
- [Abstract] Abstract: The statement that 'the moderate values reported for music videos reflect a genre-specific intro/onset-replay artifact' would benefit from a quantitative breakdown (e.g., separate correlations for music vs. non-music videos) to support the generalization claim.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below. We agree that expanding methodological details will improve the manuscript and will do so in revision. We also acknowledge the proxy limitation and will add explicit discussion of it.
read point-by-point responses
-
Referee: [Abstract] Abstract (paragraph on replay heatmaps as proxy): The central null claim that the TRIBE-derived signal 'does not predict behavioral engagement' is load-bearing on the assumption that YouTube replay heatmaps are a valid, content-sensitive proxy whose variance should be captured by predicted cortical responses after position and low-level controls. No external validation (e.g., correlation with eye-tracking dwell time, self-reported engagement, or physiological measures after the same controls) is reported; if heatmap variance is dominated by recommendation algorithms or thumbnail effects, the null does not test the model's engagement prediction capacity.
Authors: We agree that replay heatmaps constitute an indirect proxy and that the study does not include external validation against eye-tracking, self-report, or physiological measures. The manuscript frames them as a standard passively collected engagement signal. The reported null (after position and low-level controls) remains informative even under this limitation, as it indicates the predicted signal explains no additional variance in the proxy. We will revise the abstract and add a limitations paragraph noting potential confounds from recommendation algorithms and thumbnails, without claiming stronger validation than the data support. revision: partial
-
Referee: [Methods] Methods (video selection, preprocessing, and readout definitions): Full details on video selection criteria, exact computation of the global field power from TRIBE outputs, fMRI prediction preprocessing steps, and how the six network readouts are extracted are absent. This prevents evaluation of whether the null could arise from post-hoc choices, inadequate baseline matching, or video sampling biases, directly affecting the soundness of the central statistical claim.
Authors: We accept that the submitted version omitted sufficient methodological detail. The 48 videos were selected for genre diversity from publicly available sources used in the Algonauts challenge; global field power is computed as the per-second mean absolute value across predicted voxels; preprocessing follows standard fMRI pipelines with the six readouts taken from the challenge-defined cortical networks. We will expand the Methods section with explicit selection criteria, the precise GFP formula, preprocessing steps, and readout extraction procedures to enable full evaluation and replication. revision: yes
Circularity Check
No circularity: null result from direct external validation
full rationale
The paper derives a global field power curve from the pre-trained TRIBE model applied to 48 YouTube videos and performs a direct statistical correlation against independently collected replay heatmaps. No parameters are fitted to the target heatmaps, no equations reduce the reported partial correlation to a self-defined quantity, and the null result (including controls and permutation tests) stands as an external test rather than a re-expression of inputs. The analysis is self-contained against the external benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption YouTube 'most replayed' heatmaps constitute a valid proxy for content-driven behavioral engagement that cortical signals should predict
Reference graph
Works this paper leans on
-
[1]
Revisiting Feature Prediction for Learning Visual Representations from Video
A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. G. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. arXiv:2404.08471, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
G. S. Berns and S. E. Moore. A neural predictor of cultural popularity. Journal of Consumer Psychology, 22(1):154--160, 2012
work page 2012
-
[3]
Tribe: Trimodal brain encoder for whole-brain fmri response prediction
S. d'Ascoli, J. Rapin, Y. Benchetrit, H. Banville, and J.-R. King. TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction. arXiv:2507.22229, 2025
-
[4]
J. P. Dmochowski, M. A. Bezdek, B. P. Abelson, J. S. Johnson, E. H. Schumacher, and L. C. Parra. Audience preferences are predicted by temporal reliability of neural processing. Nature Communications, 5:4567, 2014
work page 2014
-
[5]
A. Genevsky, C. Yoon, and B. Knutson. When brain beats behavior: Neuroforecasting crowdfunding outcomes. Journal of Neuroscience, 37(36):8625--8634, 2017
work page 2017
-
[6]
A. Grattafiori et al. The Llama 3 herd of models. arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [7]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.