An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexander Kolesnikov; Alexey Dosovitskiy; Dirk Weissenborn; Georg Heigold; Jakob Uszkoreit; Lucas Beyer; Matthias Minderer; Mostafa Dehghani; Neil Houlsby; Sylvain Gelly

arxiv: 2010.11929 · v2 · pith:XMH7WNHXnew · submitted 2020-10-22 · 💻 cs.CV · cs.AI· cs.LG

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy , Lucas Beyer , Alexander Kolesnikov , Dirk Weissenborn , Xiaohua Zhai , Thomas Unterthiner , Mostafa Dehghani , Matthias Minderer

show 4 more authors

Georg Heigold Sylvain Gelly Jakob Uszkoreit Neil Houlsby

This is my paper

Pith reviewed 2026-05-24 14:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords vision transformerimage classificationtransformer architectureimage patchespre-trainingtransfer learningconvolutional networks

0 comments

The pith

A pure transformer applied directly to sequences of image patches performs very well on image classification tasks after large-scale pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether convolutional networks are required for strong vision performance or if a standard transformer can handle images on its own. It splits each image into a sequence of fixed-size patches, embeds them linearly, and feeds the sequence into a transformer encoder exactly as text is processed. When the resulting model is pre-trained on large data and transferred, it reaches or exceeds the accuracy of leading convolutional networks on benchmarks such as ImageNet while using less training compute. A sympathetic reader would therefore conclude that the convolutional inductive biases long assumed necessary in vision are dispensable once data and capacity are sufficient.

Core claim

The Vision Transformer processes an image by dividing it into a grid of 16x16 patches, linearly projecting each patch into an embedding, adding learnable position embeddings, and passing the resulting sequence through a standard transformer encoder. After pre-training on large datasets the model is fine-tuned on target tasks and attains excellent accuracy on ImageNet, CIFAR-100, VTAB and similar benchmarks while requiring substantially fewer computational resources than state-of-the-art convolutional networks.

What carries the argument

Vision Transformer (ViT): a standard transformer encoder applied to a sequence of linearly embedded image patches rather than to convolutional feature maps.

If this is right

ViT reaches or exceeds the accuracy of leading convolutional networks on ImageNet, CIFAR-100 and VTAB after the same pre-training.
The model trains with substantially lower computational cost than state-of-the-art CNNs while achieving comparable or better transfer performance.
Convolutional inductive biases are shown to be unnecessary once pre-training scale is large enough.
The same patch-sequence architecture transfers successfully to multiple mid-sized and small recognition benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same patch-to-sequence reduction could be tested on dense prediction tasks such as segmentation or detection to check whether the performance pattern holds beyond classification.
If the scaling behavior observed in language models also appears here, further increases in data and model size would be expected to widen the efficiency advantage over CNNs.
Alternative patch sizes or hierarchical token merging could be explored to reduce the quadratic cost of self-attention on high-resolution inputs.

Load-bearing premise

Large amounts of pre-training data and model capacity can fully compensate for the absence of convolutional inductive biases such as locality and translation equivariance.

What would settle it

A controlled experiment in which a Vision Transformer, trained and transferred under the same large-scale regime, consistently underperforms matched convolutional networks across the reported mid-sized and small image-classification benchmarks.

Figures

Figures reproduced from arXiv: 2010.11929 by Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Sylvain Gelly, Thomas Unterthiner, Xiaohua Zhai.

**Figure 2.** Figure 2: Breakdown of VTAB performance in Natural, Specialized, and Structured task groups. model still took substantially less compute to pre-train than prior state of the art. However, we note that pre-training efficiency may be affected not only by the architecture choice, but also other parameters, such as training schedule, optimizer, weight decay, etc. We provide a controlled study of performance vs. compute… view at source ↗

**Figure 3.** Figure 3: Transfer to ImageNet. While large ViT models perform worse than BiT ResNets (shaded area) when pre-trained on small datasets, they shine when pre-trained on larger datasets. Similarly, larger ViT variants overtake smaller ones as the dataset grows. 10 M 30 M 100 M 300 M Number of JFT pre-training samples 30 40 50 60 70 Linear 5-shot ImageNet Top1 [%] ViT-L/16 ViT-L/32 ViT-B/32 ViT-b/32 ResNet50x1 (BiT) Res… view at source ↗

**Figure 5.** Figure 5: Performance versus pre-training compute for different architectures: Vision Transformers, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Representative examples of attention from the output token to the input space. See Appendix D.7 for details. To begin to understand how the Vision Transformer processes image data, we analyze its internal representations. The first layer of the Vision Transformer linearly projects the flattened patches into a lower-dimensional space (Eq. 1) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Left: Filters of the initial linear embedding of RGB values of ViT-L/32. Center: Similarity of position embeddings of ViT-L/32. Tiles show the cosine similarity between the position embedding of the patch with the indicated row and column and the position embeddings of all other patches. Right: Size of attended area by head and network depth. Each dot shows the mean attention distance across images for on… view at source ↗

**Figure 8.** Figure 8: Scaling different model dimensions of the Vision Transformer. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of class-token and global average pooling classifiers. Both work similarly [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Position embeddings of models trained with different hyperparameters. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Size of attended area by head and network depth. Attention distance was computed for [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: (left) shows how many images one core can handle per second, across various input sizes. Every single point refers to the peak performance measured across a wide range of batch-sizes. As can be seen, the theoretical bi-quadratic scaling of ViT with image size only barely starts happening for the largest models at the largest resolutions. Another quantity of interest is the largest batch-size each model ca… view at source ↗

**Figure 13.** Figure 13: Performance of Axial-Attention based models, in terms of top-1 accuracy on ImageNet [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Further example attention maps as in Figure 6 (random selection). [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

read the original abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ViT shows a pure transformer on image patches matches CNNs on classification after large pre-training, with the main evidence coming from transfer results on public benchmarks.

read the letter

The one thing to take away is that this paper demonstrates a transformer applied directly to sequences of image patches can reach competitive accuracy on ImageNet and other benchmarks once pre-trained at scale on something like JFT-300M, without any convolutional layers in the model itself. That is the core empirical result, and the transfer experiments back it up across model sizes and datasets. The ablations on patch size and the direct comparisons to BiT and ResNet baselines are useful and make the setup easy to understand. The numbers are reported consistently, and the claim that CNNs are not strictly necessary is tested head-on rather than assumed. The work is reproducible in principle because the architecture is simple and the evaluation follows standard transfer protocols. The main limitation is that the strong performance depends on access to hundreds of millions of pre-training images; smaller-scale runs show bigger gaps, so the story is really about data and model scale making up for the missing inductive biases. The linear patch embedding is a minor point of structure that the paper treats as straightforward preprocessing. Readers working on vision backbones or trying to build unified sequence models will find the experiments directly relevant. The paper is worth bringing to a reading group because the result is clean and the setup is falsifiable. It deserves peer review; the central finding is new relative to the hybrid designs cited in the abstract and the evidence is presented without circular fitting or hidden selection.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Vision Transformer (ViT), a pure transformer model that tokenizes images into fixed-size patches (typically 16x16), linearly embeds them, and processes the sequence with standard transformer layers. When pre-trained on large-scale datasets such as JFT-300M and fine-tuned on ImageNet, CIFAR-100, VTAB and other benchmarks, ViT variants (Base, Large, Huge) match or exceed the accuracy of state-of-the-art CNNs while using substantially less training compute.

Significance. If the reported transfer results hold, the work is significant because it provides direct empirical evidence that convolutional inductive biases are not required for competitive image classification once sufficient pre-training data and model capacity are available. The systematic scaling experiments across model sizes and the comparison against BiT/ResNet baselines on public benchmarks constitute a clear falsifiable demonstration that patch-based tokenization plus self-attention can substitute for CNNs at scale.

minor comments (3)

[§3.1] §3.1: the linear patch embedding is described only in prose; an explicit matrix equation showing the projection from flattened patch to D-dimensional token would improve reproducibility.
[Figure 3, Table 2] Figure 3 and Table 2: the pre-training compute axis is reported in TPUv3-days; adding a second panel or column with FLOPs per image would make the efficiency claim easier to compare across hardware.
[§4.2] §4.2: the statement that ViT requires 'substantially fewer computational resources' is supported by the JFT-300M numbers but would be strengthened by an explicit wall-clock or energy comparison on the same hardware as the BiT baselines.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the Vision Transformer manuscript and the recommendation to accept.

Circularity Check

0 steps flagged

No circularity: empirical results on public benchmarks

full rationale

The paper's central claim is an empirical demonstration that a pure transformer on image patches, pre-trained at scale, matches CNN performance on standard classification tasks after transfer. This is validated directly via experiments (ViT variants pre-trained on JFT-300M, fine-tuned on ImageNet/CIFAR-100/VTAB) with ablations and baselines; no derivation chain, equations, or fitted parameters reduce to the evaluation data by construction. The premise that CNN inductive biases are unnecessary is tested rather than smuggled in via self-definition or self-citation. The work is self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The claim rests on the standard transformer self-attention definition from prior literature plus the modeling choice of fixed-size patch tokenization; no new physical or mathematical axioms are introduced.

free parameters (2)

patch size
16x16 chosen as the tokenization granularity; affects sequence length and local information retention.
model scale (base/large/huge)
Number of layers, hidden size, and heads are selected hyperparameters that determine capacity.

axioms (1)

standard math Self-attention and positional encoding as defined in the original Transformer paper
The architecture is imported wholesale from Vaswani et al. without modification to the core mechanism.

invented entities (1)

Linear patch embedding no independent evidence
purpose: Projects flattened image patches into the transformer token space
New input representation required to feed images into the sequence model; no independent evidence outside the empirical results is provided.

pith-pipeline@v0.9.0 · 5708 in / 1372 out tokens · 33974 ms · 2026-05-24T14:20:39.704214+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark
cs.CV 2026-04 conditional novelty 9.0

DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.
Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature
cs.CV 2026-06 accept novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.
DataComp-VLM: Improved Open Datasets for Vision-Language Models
cs.CV 2026-06 conditional novelty 8.0

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
cs.LG 2026-06 unverdicted novelty 8.0

StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
cs.CR 2026-05 unverdicted novelty 8.0

VIPER exposes Functional Fusion in dynamic prompt architectures, enabling a backdoor that resists pruning by tightly integrating attack and utility parameters in the same high-magnitude core.
iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning
cs.CV 2026-05 unverdicted novelty 8.0

iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotio...
Privacy Auditing with Zero (0) Training Run
cs.CR 2026-05 unverdicted novelty 8.0

Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
Dissecting Jet-Tagger Through Mechanistic Interpretability
hep-ph 2026-05 accept novelty 8.0

A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
Gradient-Based Program Synthesis with Neurally Interpreted Languages
cs.LG 2026-04 unverdicted novelty 8.0

NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
cs.CV 2026-01 unverdicted novelty 8.0

S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
A document is worth a structured record: Principled inductive bias design for document recognition
cs.CV 2025-07 unverdicted novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
cs.CL 2023-09 unverdicted novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
cs.RO 2023-03 accept novelty 8.0

Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.
Efficiently Modeling Long Sequences with Structured State Spaces
cs.LG 2021-10 unverdicted novelty 8.0

S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
Decision Transformer: Reinforcement Learning via Sequence Modeling
cs.LG 2021-06 accept novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
Emerging Properties in Self-Supervised Vision Transformers
cs.CV 2021-04 conditional novelty 8.0

Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives
cs.CV 2026-07 unverdicted novelty 7.0

LeVLJEPA is the first non-contrastive vision-language pretraining method that learns via cross-modal prediction without negatives, producing stronger dense features than contrastive baselines on VQA and segmentation tasks.
Language-Assisted Super-Resolution from Real-World Low-Resolution Patches
cs.CV 2026-06 unverdicted novelty 7.0

LA-SR redefines unpaired super-resolution in language space by projecting images into a semantically rich representation and applying vision-language model guided losses to handle real-world degradations extracted fro...
CORDEX-ML-Bench: A Benchmark for Data-Driven Regional Climate Downscaling -Experiment Design and Overview
physics.ao-ph 2026-06 unverdicted novelty 7.0

CORDEX-ML-Bench benchmarks 40 ML models for climate downscaling and finds generative models outperform deterministic ones on precipitation while historically trained models underestimate future climate signals.
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks
cs.RO 2026-06 unverdicted novelty 7.0

TISED framework reveals paradoxical effects where inference optimizations can lengthen task completion time on static tasks or raise success rates on dynamic tasks in embodied AI.
Higher-Order Fourier Neural Operator: Explicit Mode Mixer for Nonlinear PDEs
cs.CE 2026-06 unverdicted novelty 7.0

HO-FNO extends standard FNO with n-linear spectral mixing and shows improved accuracy on nonlinear PDE benchmarks, sometimes with a single layer beating deeper FNO models.
A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$
cs.CV 2026-06 unverdicted novelty 7.0

A unified family of vision transformers equivariant to arbitrary discrete subgroups of O(2), with embedding and expressivity theorems, a D6 construction using hexagonal patches, and experiments on aerial images in low...
A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$
cs.CV 2026-06 unverdicted novelty 7.0

Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal ...
Layerwise Progressive Freezing: A Training Scaffold for Depth-Scalable Binary Networks
cs.LG 2026-06 unverdicted novelty 7.0

StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge
cs.CV 2026-06 unverdicted novelty 7.0

LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correl...
Tessellating The Earth
cs.CV 2026-06 unverdicted novelty 7.0

TTE replaces fixed spherical bases with differentiable Voronoi partitions plus shared semantic tokens to create adaptive geolocation encoders that reach new SOTA on geospatial tasks and iNaturalist species classification.
TacVerse: A Multi-Sensor Dataset and Benchmark for Cross-Sensor Vision-Based Tactile Perception
cs.RO 2026-06 unverdicted novelty 7.0

TacVerse is a new multi-sensor tactile dataset with 106,800 images from seven VBTS designs that benchmarks within-sensor performance, zero-shot cross-sensor transfer, and few-shot adaptation on shape, grating, and for...
Communicability-Inspired Positional Encoding (CIPE)
cs.LG 2026-06 unverdicted novelty 7.0

CIPE constructs graph positional encodings from communicability so that self-attention similarities equal the sum of all-path contributions between nodes, yielding 35.5% average gains on seven benchmarks over structur...
FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation
cs.CV 2026-06 unverdicted novelty 7.0

FLAT maps compressed video diffusion latents to explicit triangle splats via ray-centered rotation parameterization and a product window function, reporting better geometric accuracy than 3D Gaussian baselines under i...
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
cs.CV 2026-06 unverdicted novelty 7.0

PatternGSL is a learnable template-free language for garment sewing patterns enabling direct VLM prediction of simulation-ready 3D garments from images, backed by a 300K image-to-specification dataset.
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
cs.CV 2026-06 unverdicted novelty 7.0

PatternGSL is a new template-free specification language for complete sewing patterns that enables direct single-image prediction of simulation-ready garments via a vision-language model, supported by a new 300K paire...
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
cs.CV 2026-06 unverdicted novelty 7.0

PatternGSL defines a compact, learnable specification language for sewing patterns that enables direct image-to-structured-garment prediction via VLM without templates or post-optimization, supported by a 300K dataset.
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
cs.CV 2026-06 unverdicted novelty 7.0

PatternGSL introduces a learnable specification language for sewing patterns that lets vision-language models reconstruct explicit, simulation-ready 3D garments from single images, backed by a new 300K paired dataset.
MotifGen: Spatiotemporal interpolation of misaligned satellite images via multi-source generative modeling, in an application to tropical cyclones
cs.CV 2026-06 unverdicted novelty 7.0

MotifGen is the first multi-source generative model for spatiotemporal interpolation of misaligned microwave cyclone images from heterogeneous instruments at irregular intervals, achieving lower CRPS via self-supervis...
Tapered Language Models
cs.LG 2026-06 unverdicted novelty 7.0

Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales...
PanoVine: Whole-Body Visuomotor Control for Soft Growing Vine Robot
cs.RO 2026-06 unverdicted novelty 7.0

Introduces the first autonomous whole-body vision control system for soft vine robots via an end-to-end visuomotor policy trained on demonstrations.
HiMatch-AD: DINOv3-driven Hierarchical Matching for Training-free Medical Anomaly Detection
cs.CV 2026-06 unverdicted novelty 7.0

HiMatch-AD proposes DINOv3-driven hierarchical matching with uncertainty-based fusion for training-free medical anomaly detection and reports outperformance on the BMAD benchmark.
Human and AI collaboration for pulmonary nodule segmentation
cs.CV 2026-06 unverdicted novelty 7.0

Hi-Seg achieves a mean Dice score of nearly 85% for pulmonary nodule segmentation by having humans iteratively refine prompts for the Segment Anything Model, outperforming standalone deep learning and SAM models on a ...
Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers
cs.CV 2026-06 unverdicted novelty 7.0

HyperAdapter performs PEFT of ViTs via soft hypergraph construction, hyperedge-level bottleneck adaptation, and incidence-based diffusion, claiming consistent gains over token-wise adapters on structured visual benchmarks.
MADField: Multi-fidelity Amortized Density Field for Adsorption in Nanoporous Materials
physics.comp-ph 2026-06 unverdicted novelty 7.0

MADField is a multi-fidelity amortized model for predicting density fields to improve accuracy and speed of adsorption calculations in nanoporous materials for high-throughput screening.
eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization
cs.AI 2026-06 unverdicted novelty 7.0

eCNNTO applies an element-wise CNN with residual connections and final-stage training data to accelerate density-based topology optimization while generalizing across boundary conditions, loads, geometries, and mesh sizes.
EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 7.0

EquiVLA is the first general framework for end-to-end SO(2)-equivariant VLA models using EquiPerceptor and EquiActor modules, reporting improved success rates on LIBERO, CALVIN, and real-robot benchmarks.
Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy
astro-ph.IM 2026-06 unverdicted novelty 7.0

A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
When LLMs Analyze Scars: From Images to Clinically-Meaningful Features
cs.CV 2026-06 unverdicted novelty 7.0

LLMs generate deterministic code to convert scar images into low-dimensional clinical features for classification, claimed to outperform end-to-end deep learning when training data is scarce.
Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset
cs.CV 2026-06 conditional novelty 7.0

CloudLULC-Net is an end-to-end heterogeneous SAR-optical fusion network for LULC mapping under cloud contamination that achieves 86.60% OA, 83.29% F1, and 73.51% mIoU on a new global benchmark of 40,223 samples.
LLM Agents Can See Code Repositories
cs.SE 2026-06 unverdicted novelty 7.0

Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
cs.RO 2026-06 unverdicted novelty 7.0

FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
Multi-Label Test-Time Adaptation with Bayesian Conditional Priors
cs.CV 2026-06 unverdicted novelty 7.0

BCP refines zero-shot logits via anchor-conditioned Bayesian updates estimated online from test-stream co-occurrences to promote compatible labels and improve mAP in multi-label classification.
Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction
cs.CV 2026-06 unverdicted novelty 7.0

DeBias-Attack corrects surrogate-specific bias in adversarial gradients for VLP models by subtracting the projection from a reference branch optimized on weak-semantic images.
Multi-channel Optical Vision Model
physics.optics 2026-06 unverdicted novelty 7.0

Spatial multiplexing in optical neural networks is repurposed as a trainable representational coordinate, demonstrated in multi-layer architectures for image classification, regression, and hybrid vision-language capt...
Spatiotemporal Graph Transformer for 3D Neighborhood Interaction and Quality Prediction in Metal Additive Manufacturing
cs.LG 2026-06 unverdicted novelty 7.0

A dual-attention graph transformer on a weighted network representation of fusing locations models cross-layer interactions to improve quality prediction in metal additive manufacturing over image, sequence, and graph...
EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video
cs.CV 2026-06 unverdicted novelty 7.0

EgoTactile benchmark and EgoPressureDiff diffusion framework for estimating full-hand grasp pressure from egocentric video.
NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis
cs.CV 2026-06 unverdicted novelty 7.0

NutriMLLM models fine-tuned on 1.1 million synthetic food image-nutrient triplets from population dietary recalls achieve near-complete coverage and competitive accuracy on real food images for comprehensive micronutr...
AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens
cs.CV 2026-06 unverdicted novelty 7.0

AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token ...
Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets
cs.CV 2026-06 unverdicted novelty 7.0

ZeroSight supplies a video-derived dataset and evaluation protocol for genuine zero-shot composed image retrieval plus the SC4CIR consistency method, demonstrating that prior benchmarks inflate reported performance ac...
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
cs.CV 2026-06 unverdicted novelty 7.0

VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
cs.CV 2026-06 unverdicted novelty 7.0

VLMs exhibit anchoring to discrete slant angles rather than graded responses across zero-shot, in-context, and fine-tuned settings, unlike human psychophysical patterns.
CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation
cs.CV 2026-06 unverdicted novelty 7.0

CIPER is a unified transformer that jointly performs cross-view image retrieval and 3-DoF pose estimation using shared encoder features, task-specific tokens, and bidirectional cross-attention.
Toward Calibrated, Fair, and accurate Deepfake Detection
cs.LG 2026-06 unverdicted novelty 7.0

Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 1114 Pith papers · 1 internal anchor

[1]

Adaptive input representations for neural language modeling

9 Published as a conference paper at ICLR 2021 Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR,

work page 2021
[2]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

10 Published as a conference paper at ICLR 2021 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift

work page 2021
[3]

Polyak and Anatoli B

doi: 10.1137/0330046. URL https://doi.org/10.1137/0330046. Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv preprint arXiv:1903.10520,

work page doi:10.1137/0330046 1903
[4]

Fixing the train-test resolution discrepancy

11 Published as a conference paper at ICLR 2021 Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In NeurIPS

work page 2021
[5]

Fixing the train-test resolution discrepancy: Fixefﬁcientnet

Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy: Fixefﬁcientnet. arXiv preprint arXiv:2003.08237,

work page arXiv 2003
[6]

Axial-deeplab: Stand-alone axial-attention for panoptic segmentation

Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020a. Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.0...

work page arXiv 2003
[7]

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4L: Self-Supervised Semi- Supervised Learning. In ICCV, 2019a. Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning w...

work page internal anchor Pith review Pith/arXiv arXiv 1910
[8]

All models are trained with a batch size of 4096 and learn- ing rate warmup of 10k steps

12 Published as a conference paper at ICLR 2021 Models Dataset Epochs Base LR LR decay Weight decay Dropout ViT-B/{16,32} JFT-300M 7 8· 10−4 linear 0.1 0.0 ViT-L/32 JFT-300M 7 6· 10−4 linear 0.1 0.0 ViT-L/16 JFT-300M 7/14 4· 10−4 linear 0.1 0.0 ViT-H/14 JFT-300M 14 3· 10−4 linear 0.1 0.0 R50x{1,2} JFT-300M 7 10−3 linear 0.1 0.0 R101x1 JFT-300M 7 8· 10−4 l...

work page 2021
[9]

(2017)) is a popular building block for neural archi- tectures

APPENDIX A M ULTIHEAD SELF -ATTENTION Standard qkv self-attention (SA, Vaswani et al. (2017)) is a popular building block for neural archi- tectures. For each element in an input sequence z∈ RN×D, we compute a weighted sum over all values v in the sequence. The attention weights Aij are based on the pairwise similarity between two elements of the sequence...

work page 2017
[10]

For ﬁnal results we train on the entire training set and evaluate on the respective test data

To do so, we use small sub-splits from the training set (10% for Pets and Flowers, 2% for CIFAR, 1% ImageNet) as development set and train on the remaining data. For ﬁnal results we train on the entire training set and evaluate on the respective test data. For ﬁne-tuning ResNets and hybrid models we use the exact same setup, with the only exception of Ima...

work page 2021
[11]

(2020) and select the best results across this run and our sweep

for ResNets we also run the setup of Kolesnikov et al. (2020) and select the best results across this run and our sweep. Finally, if not mentioned otherwise, all ﬁne-tuning experiments run at 384 resolution (running ﬁne-tuning at different resolution than training is common practice (Kolesnikov et al., 2020)). When transferring ViT models to another datas...

work page 2020
[12]

B.1.2 S ELF -SUPERVISION We employ the masked patch prediction objective for preliminary self-supervision experiments

for all tasks. B.1.2 S ELF -SUPERVISION We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%). This setup is very similar to ...

work page 2019
[13]

We also experimented with 15% corruption rate as used by Devlin et al

because it has shown best few-shot performance. We also experimented with 15% corruption rate as used by Devlin et al. (2019) but results were also slightly worse on our few-shot metrics. Lastly, we would like to remark that our instantiation of masked patch prediction doesn’t require such an enormous amount of pretraining nor a large dataset such as JFT ...

work page 2019
[14]

These correspond to Figure 5 in the main paper

Epochs ImageNet ImageNet ReaL CIFAR-10 CIFAR-100 Pets Flowers exaFLOPs name ViT-B/32 7 80.73 86.27 98.61 90.49 93.40 99.27 55 ViT-B/16 7 84.15 88.85 99.00 91.87 95.80 99.56 224 ViT-L/32 7 84.37 88.28 99.19 92.52 95.83 99.45 196 ViT-L/16 7 86.30 89.43 99.38 93.46 96.81 99.66 783 ViT-L/16 14 87.12 89.99 99.38 94.04 97.11 99.56 1567 ViT-H/14 14 88.08 90.36 9...

work page 2021
[15]

This justiﬁes the choice of Adam as the optimizer used to pre-train ResNets on JFT

Adam pre-training outperforms SGD pre-training on most datasets and on average. This justiﬁes the choice of Adam as the optimizer used to pre-train ResNets on JFT. Note that the absolute numbers are lower than those reported by Kolesnikov et al. (2020), since we pre-train only for 7 epochs, not

work page 2020
[16]

Figure 8 shows 5-shot performance on ImageNet for different conﬁgurations

D.2 T RANSFORMER SHAPE We ran ablations on scaling different dimensions of the Transformer architecture to ﬁnd out which are best suited for scaling to very large models. Figure 8 shows 5-shot performance on ImageNet for different conﬁgurations. All conﬁgurations are based on a ViT model with8 layers,D = 1024, DM LP = 2048 and a patch size of 32, the inte...

work page 2048
[17]

We tried the following cases: • Providing no positional information: Considering the inputs as a bag of patches

D.4 P OSITIONAL EMBEDDING We ran ablations on different ways of encoding spatial information using positional embedding. We tried the following cases: • Providing no positional information: Considering the inputs as a bag of patches. • 1-dimensional positional embedding: Considering the inputs as a sequence of patches in the raster order (default across a...

work page 2021
[18]

attention distance

is a simple, yet effective technique to run self- attention on large inputs that are organized as multidimensional tensors. The general idea of axial attention is to perform multiple attention operations, each along a single axis of the input tensor, instead of applying 1-dimensional attention to the ﬂattened version of the input. In axial attention, each...

work page 2021

[1] [1]

Adaptive input representations for neural language modeling

9 Published as a conference paper at ICLR 2021 Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR,

work page 2021

[2] [2]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

10 Published as a conference paper at ICLR 2021 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift

work page 2021

[3] [3]

Polyak and Anatoli B

doi: 10.1137/0330046. URL https://doi.org/10.1137/0330046. Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv preprint arXiv:1903.10520,

work page doi:10.1137/0330046 1903

[4] [4]

Fixing the train-test resolution discrepancy

11 Published as a conference paper at ICLR 2021 Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In NeurIPS

work page 2021

[5] [5]

Fixing the train-test resolution discrepancy: Fixefﬁcientnet

Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy: Fixefﬁcientnet. arXiv preprint arXiv:2003.08237,

work page arXiv 2003

[6] [6]

Axial-deeplab: Stand-alone axial-attention for panoptic segmentation

Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020a. Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.0...

work page arXiv 2003

[7] [7]

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4L: Self-Supervised Semi- Supervised Learning. In ICCV, 2019a. Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning w...

work page internal anchor Pith review Pith/arXiv arXiv 1910

[8] [8]

All models are trained with a batch size of 4096 and learn- ing rate warmup of 10k steps

12 Published as a conference paper at ICLR 2021 Models Dataset Epochs Base LR LR decay Weight decay Dropout ViT-B/{16,32} JFT-300M 7 8· 10−4 linear 0.1 0.0 ViT-L/32 JFT-300M 7 6· 10−4 linear 0.1 0.0 ViT-L/16 JFT-300M 7/14 4· 10−4 linear 0.1 0.0 ViT-H/14 JFT-300M 14 3· 10−4 linear 0.1 0.0 R50x{1,2} JFT-300M 7 10−3 linear 0.1 0.0 R101x1 JFT-300M 7 8· 10−4 l...

work page 2021

[9] [9]

(2017)) is a popular building block for neural archi- tectures

APPENDIX A M ULTIHEAD SELF -ATTENTION Standard qkv self-attention (SA, Vaswani et al. (2017)) is a popular building block for neural archi- tectures. For each element in an input sequence z∈ RN×D, we compute a weighted sum over all values v in the sequence. The attention weights Aij are based on the pairwise similarity between two elements of the sequence...

work page 2017

[10] [10]

For ﬁnal results we train on the entire training set and evaluate on the respective test data

To do so, we use small sub-splits from the training set (10% for Pets and Flowers, 2% for CIFAR, 1% ImageNet) as development set and train on the remaining data. For ﬁnal results we train on the entire training set and evaluate on the respective test data. For ﬁne-tuning ResNets and hybrid models we use the exact same setup, with the only exception of Ima...

work page 2021

[11] [11]

(2020) and select the best results across this run and our sweep

for ResNets we also run the setup of Kolesnikov et al. (2020) and select the best results across this run and our sweep. Finally, if not mentioned otherwise, all ﬁne-tuning experiments run at 384 resolution (running ﬁne-tuning at different resolution than training is common practice (Kolesnikov et al., 2020)). When transferring ViT models to another datas...

work page 2020

[12] [12]

B.1.2 S ELF -SUPERVISION We employ the masked patch prediction objective for preliminary self-supervision experiments

for all tasks. B.1.2 S ELF -SUPERVISION We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%). This setup is very similar to ...

work page 2019

[13] [13]

We also experimented with 15% corruption rate as used by Devlin et al

because it has shown best few-shot performance. We also experimented with 15% corruption rate as used by Devlin et al. (2019) but results were also slightly worse on our few-shot metrics. Lastly, we would like to remark that our instantiation of masked patch prediction doesn’t require such an enormous amount of pretraining nor a large dataset such as JFT ...

work page 2019

[14] [14]

These correspond to Figure 5 in the main paper

Epochs ImageNet ImageNet ReaL CIFAR-10 CIFAR-100 Pets Flowers exaFLOPs name ViT-B/32 7 80.73 86.27 98.61 90.49 93.40 99.27 55 ViT-B/16 7 84.15 88.85 99.00 91.87 95.80 99.56 224 ViT-L/32 7 84.37 88.28 99.19 92.52 95.83 99.45 196 ViT-L/16 7 86.30 89.43 99.38 93.46 96.81 99.66 783 ViT-L/16 14 87.12 89.99 99.38 94.04 97.11 99.56 1567 ViT-H/14 14 88.08 90.36 9...

work page 2021

[15] [15]

This justiﬁes the choice of Adam as the optimizer used to pre-train ResNets on JFT

Adam pre-training outperforms SGD pre-training on most datasets and on average. This justiﬁes the choice of Adam as the optimizer used to pre-train ResNets on JFT. Note that the absolute numbers are lower than those reported by Kolesnikov et al. (2020), since we pre-train only for 7 epochs, not

work page 2020

[16] [16]

Figure 8 shows 5-shot performance on ImageNet for different conﬁgurations

D.2 T RANSFORMER SHAPE We ran ablations on scaling different dimensions of the Transformer architecture to ﬁnd out which are best suited for scaling to very large models. Figure 8 shows 5-shot performance on ImageNet for different conﬁgurations. All conﬁgurations are based on a ViT model with8 layers,D = 1024, DM LP = 2048 and a patch size of 32, the inte...

work page 2048

[17] [17]

We tried the following cases: • Providing no positional information: Considering the inputs as a bag of patches

D.4 P OSITIONAL EMBEDDING We ran ablations on different ways of encoding spatial information using positional embedding. We tried the following cases: • Providing no positional information: Considering the inputs as a bag of patches. • 1-dimensional positional embedding: Considering the inputs as a sequence of patches in the raster order (default across a...

work page 2021

[18] [18]

attention distance

is a simple, yet effective technique to run self- attention on large inputs that are organized as multidimensional tensors. The general idea of axial attention is to perform multiple attention operations, each along a single axis of the input tensor, instead of applying 1-dimensional attention to the ﬂattened version of the input. In axial attention, each...

work page 2021