An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Pith reviewed 2026-05-24 14:20 UTC · model grok-4.3
The pith
A pure transformer applied directly to sequences of image patches performs very well on image classification tasks after large-scale pre-training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Vision Transformer processes an image by dividing it into a grid of 16x16 patches, linearly projecting each patch into an embedding, adding learnable position embeddings, and passing the resulting sequence through a standard transformer encoder. After pre-training on large datasets the model is fine-tuned on target tasks and attains excellent accuracy on ImageNet, CIFAR-100, VTAB and similar benchmarks while requiring substantially fewer computational resources than state-of-the-art convolutional networks.
What carries the argument
Vision Transformer (ViT): a standard transformer encoder applied to a sequence of linearly embedded image patches rather than to convolutional feature maps.
If this is right
- ViT reaches or exceeds the accuracy of leading convolutional networks on ImageNet, CIFAR-100 and VTAB after the same pre-training.
- The model trains with substantially lower computational cost than state-of-the-art CNNs while achieving comparable or better transfer performance.
- Convolutional inductive biases are shown to be unnecessary once pre-training scale is large enough.
- The same patch-sequence architecture transfers successfully to multiple mid-sized and small recognition benchmarks.
Where Pith is reading between the lines
- The same patch-to-sequence reduction could be tested on dense prediction tasks such as segmentation or detection to check whether the performance pattern holds beyond classification.
- If the scaling behavior observed in language models also appears here, further increases in data and model size would be expected to widen the efficiency advantage over CNNs.
- Alternative patch sizes or hierarchical token merging could be explored to reduce the quadratic cost of self-attention on high-resolution inputs.
Load-bearing premise
Large amounts of pre-training data and model capacity can fully compensate for the absence of convolutional inductive biases such as locality and translation equivariance.
What would settle it
A controlled experiment in which a Vision Transformer, trained and transferred under the same large-scale regime, consistently underperforms matched convolutional networks across the reported mid-sized and small image-classification benchmarks.
Figures
read the original abstract
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Vision Transformer (ViT), a pure transformer model that tokenizes images into fixed-size patches (typically 16x16), linearly embeds them, and processes the sequence with standard transformer layers. When pre-trained on large-scale datasets such as JFT-300M and fine-tuned on ImageNet, CIFAR-100, VTAB and other benchmarks, ViT variants (Base, Large, Huge) match or exceed the accuracy of state-of-the-art CNNs while using substantially less training compute.
Significance. If the reported transfer results hold, the work is significant because it provides direct empirical evidence that convolutional inductive biases are not required for competitive image classification once sufficient pre-training data and model capacity are available. The systematic scaling experiments across model sizes and the comparison against BiT/ResNet baselines on public benchmarks constitute a clear falsifiable demonstration that patch-based tokenization plus self-attention can substitute for CNNs at scale.
minor comments (3)
- [§3.1] §3.1: the linear patch embedding is described only in prose; an explicit matrix equation showing the projection from flattened patch to D-dimensional token would improve reproducibility.
- [Figure 3, Table 2] Figure 3 and Table 2: the pre-training compute axis is reported in TPUv3-days; adding a second panel or column with FLOPs per image would make the efficiency claim easier to compare across hardware.
- [§4.2] §4.2: the statement that ViT requires 'substantially fewer computational resources' is supported by the JFT-300M numbers but would be strengthened by an explicit wall-clock or energy comparison on the same hardware as the BiT baselines.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the Vision Transformer manuscript and the recommendation to accept.
Circularity Check
No circularity: empirical results on public benchmarks
full rationale
The paper's central claim is an empirical demonstration that a pure transformer on image patches, pre-trained at scale, matches CNN performance on standard classification tasks after transfer. This is validated directly via experiments (ViT variants pre-trained on JFT-300M, fine-tuned on ImageNet/CIFAR-100/VTAB) with ablations and baselines; no derivation chain, equations, or fitted parameters reduce to the evaluation data by construction. The premise that CNN inductive biases are unnecessary is tested rather than smuggled in via self-definition or self-citation. The work is self-contained against external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- patch size
- model scale (base/large/huge)
axioms (1)
- standard math Self-attention and positional encoding as defined in the original Transformer paper
invented entities (1)
-
Linear patch embedding
no independent evidence
Forward citations
Cited by 60 Pith papers
-
DyABD: The Abdominal Muscle Segmentation in Dynamic MRI Benchmark
DyABD is the first benchmark dataset for abdominal muscle segmentation in dynamic MRIs featuring exercise-induced anatomical changes and pre/post-surgery scans, where existing models achieve an average Dice score of 0.82.
-
Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature
MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.
-
DataComp-VLM: Improved Open Datasets for Vision-Language Models
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
-
StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation
StreamKL is the first fused GPU primitive for attention KL divergence that reduces memory from O(N_Q N_K) to O(1) via an online one-pass formulation and tile-wise recomputation.
-
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
VIPER exposes Functional Fusion in dynamic prompt architectures, enabling a backdoor that resists pruning by tightly integrating attack and utility parameters in the same high-magnitude core.
-
iMiGUE-3K: A Large-Scale Benchmark for Micro-Gesture Analysis with Self-Supervised Learning
iMiGUE-3K is the largest in-the-wild micro-gesture video dataset with 3.4K clips and 37M frames from real interviews, supporting self-supervised foundation models and benchmarks that show micro-gestures improve emotio...
-
Privacy Auditing with Zero (0) Training Run
Zero-Run auditing supplies valid lower bounds on differential privacy parameters from fixed member and non-member datasets by modeling and correcting distribution-shift confounding via causal-inference techniques.
-
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
-
Dissecting Jet-Tagger Through Mechanistic Interpretability
A Particle Transformer jet tagger contains a sparse six-head circuit whose source-relay-readout structure recovers most performance and whose residual stream preferentially encodes 2-prong energy correlators.
-
Gradient-Based Program Synthesis with Neurally Interpreted Languages
NLI autonomously discovers a vocabulary of primitive operations and interprets variable-length programs via a neural executor, allowing end-to-end training and gradient-based test-time adaptation that outperforms prio...
-
S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding
S1-MMAlign is a new large-scale dataset of 15.5 million semantically enhanced scientific image-text pairs created via an AI recaptioning pipeline to improve multimodal understanding.
-
A document is worth a structured record: Principled inductive bias design for document recognition
Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
-
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
-
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.
-
Efficiently Modeling Long Sequences with Structured State Spaces
S4 is an efficient state space sequence model that captures long-range dependencies via structured parameterization of the SSM, achieving state-of-the-art results on the Long Range Arena and other benchmarks while bei...
-
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
LeVLJEPA: End-to-End Vision-Language Pretraining Without Negatives
LeVLJEPA is the first non-contrastive vision-language pretraining method that learns via cross-modal prediction without negatives, producing stronger dense features than contrastive baselines on VQA and segmentation tasks.
-
Language-Assisted Super-Resolution from Real-World Low-Resolution Patches
LA-SR redefines unpaired super-resolution in language space by projecting images into a semantically rich representation and applying vision-language model guided losses to handle real-world degradations extracted fro...
-
CORDEX-ML-Bench: A Benchmark for Data-Driven Regional Climate Downscaling -Experiment Design and Overview
CORDEX-ML-Bench benchmarks 40 ML models for climate downscaling and finds generative models outperform deterministic ones on precipitation while historically trained models underestimate future climate signals.
-
The Speedup Paradox: Rethinking Inference Speed-Quality Trade-off in Embodied Tasks
TISED framework reveals paradoxical effects where inference optimizations can lengthen task completion time on static tasks or raise success rates on dynamic tasks in embodied AI.
-
Higher-Order Fourier Neural Operator: Explicit Mode Mixer for Nonlinear PDEs
HO-FNO extends standard FNO with n-linear spectral mixing and shows improved accuracy on nonlinear PDE benchmarks, sometimes with a single layer beating deeper FNO models.
-
A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$
A unified family of vision transformers equivariant to arbitrary discrete subgroups of O(2), with embedding and expressivity theorems, a D6 construction using hexagonal patches, and experiments on aerial images in low...
-
A Unified Framework for Vision Transformers Equivariant to Discrete Subgroups of $\mathrm{O}(2)$
Constructs G-equivariant ViTs for arbitrary discrete G ≤ O(2), proves H ≤ G implies G-models embed into H-models and single-head equivariant attention realizes all ordinary G-equivariant maps, introduces D6 hexagonal ...
-
Layerwise Progressive Freezing: A Training Scaffold for Depth-Scalable Binary Networks
StoMPP progressively binarizes BNN layers layerwise from input to output via stochastic masks, delivering depth-scalable accuracy gains in a fully STE-free regime by controlling activation-induced gradient blockades.
-
Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge
LaViD distills LLM conceptual knowledge to vision models via LLM-generated MCQ soft labels, outperforming vision-language distillation baselines on fine-grained benchmarks while improving robustness on spurious correl...
-
Tessellating The Earth
TTE replaces fixed spherical bases with differentiable Voronoi partitions plus shared semantic tokens to create adaptive geolocation encoders that reach new SOTA on geospatial tasks and iNaturalist species classification.
-
TacVerse: A Multi-Sensor Dataset and Benchmark for Cross-Sensor Vision-Based Tactile Perception
TacVerse is a new multi-sensor tactile dataset with 106,800 images from seven VBTS designs that benchmarks within-sensor performance, zero-shot cross-sensor transfer, and few-shot adaptation on shape, grating, and for...
-
Communicability-Inspired Positional Encoding (CIPE)
CIPE constructs graph positional encodings from communicability so that self-attention similarities equal the sum of all-path contributions between nodes, yielding 35.5% average gains on seven benchmarks over structur...
-
FLAT: Feedforward Latent Triangle Splatting for Geometrically Accurate Scene Generation
FLAT maps compressed video diffusion latents to explicit triangle splats via ray-centered rotation parameterization and a product window function, reporting better geometric accuracy than 3D Gaussian baselines under i...
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL is a learnable template-free language for garment sewing patterns enabling direct VLM prediction of simulation-ready 3D garments from images, backed by a 300K image-to-specification dataset.
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL is a new template-free specification language for complete sewing patterns that enables direct single-image prediction of simulation-ready garments via a vision-language model, supported by a new 300K paire...
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL defines a compact, learnable specification language for sewing patterns that enables direct image-to-structured-garment prediction via VLM without templates or post-optimization, supported by a 300K dataset.
-
PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments
PatternGSL introduces a learnable specification language for sewing patterns that lets vision-language models reconstruct explicit, simulation-ready 3D garments from single images, backed by a new 300K paired dataset.
-
MotifGen: Spatiotemporal interpolation of misaligned satellite images via multi-source generative modeling, in an application to tropical cyclones
MotifGen is the first multi-source generative model for spatiotemporal interpolation of misaligned microwave cyclone images from heterogeneous instruments at irregular intervals, achieving lower CRPS via self-supervis...
-
Tapered Language Models
Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales...
-
PanoVine: Whole-Body Visuomotor Control for Soft Growing Vine Robot
Introduces the first autonomous whole-body vision control system for soft vine robots via an end-to-end visuomotor policy trained on demonstrations.
-
HiMatch-AD: DINOv3-driven Hierarchical Matching for Training-free Medical Anomaly Detection
HiMatch-AD proposes DINOv3-driven hierarchical matching with uncertainty-based fusion for training-free medical anomaly detection and reports outperformance on the BMAD benchmark.
-
Human and AI collaboration for pulmonary nodule segmentation
Hi-Seg achieves a mean Dice score of nearly 85% for pulmonary nodule segmentation by having humans iteratively refine prompts for the Segment Anything Model, outperforming standalone deep learning and SAM models on a ...
-
Structured Hyperedge Adaptation for Parameter-Efficient Fine-Tuning of Vision Transformers
HyperAdapter performs PEFT of ViTs via soft hypergraph construction, hyperedge-level bottleneck adaptation, and incidence-based diffusion, claiming consistent gains over token-wise adapters on structured visual benchmarks.
-
MADField: Multi-fidelity Amortized Density Field for Adsorption in Nanoporous Materials
MADField is a multi-fidelity amortized model for predicting density fields to improve accuracy and speed of adsorption calculations in nanoporous materials for high-throughput screening.
-
eCNNTO: A Highly Generalizable ConvNet for Accelerating Topology Optimization
eCNNTO applies an element-wise CNN with residual connections and final-stage training data to accelerate density-based topology optimization while generalizing across boundary conditions, loads, geometries, and mesh sizes.
-
EquiVLA: A General Framework for Rotationally Equivariant Vision-Language-Action Models
EquiVLA is the first general framework for end-to-end SO(2)-equivariant VLA models using EquiPerceptor and EquiActor modules, reporting improved success rates on LIBERO, CALVIN, and real-robot benchmarks.
-
Polarisation and Faraday rotation measure imaging at metre wavelengths with sub-arcsecond resolution: a foundational calibration strategy
A calibration strategy using full-Jones corrections with an in-field unpolarised calibrator and visibility-based multi-epoch alignment enables sub-arcsecond polarimetric imaging with LOFAR at metre wavelengths.
-
When LLMs Analyze Scars: From Images to Clinically-Meaningful Features
LLMs generate deterministic code to convert scar images into low-dimensional clinical features for classification, claimed to outperform end-to-end deep learning when training data is scarce.
-
Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset
CloudLULC-Net is an end-to-end heterogeneous SAR-optical fusion network for LULC mapping under cloud contamination that achieves 86.60% OA, 83.29% F1, and 73.51% mIoU on a new global benchmark of 40,223 samples.
-
LLM Agents Can See Code Repositories
Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.
-
FTP-1: A Generalist Foundation Tactile Policy Across Tactile Sensors for Contact-Rich Manipulation
FTP-1 is the first foundation tactile policy pretrained on ~3000 hours of data from 26 sources across 21 sensors that improves performance on seen setups by 17.2% and transfers to unseen sensors with 31% success rate gain.
-
Multi-Label Test-Time Adaptation with Bayesian Conditional Priors
BCP refines zero-shot logits via anchor-conditioned Bayesian updates estimated online from test-stream co-occurrences to promote compatible labels and improve mAP in multi-label classification.
-
Improving Adversarial Transferability on Vision-Language Pre-training Models via Surrogate-Specific Bias Correction
DeBias-Attack corrects surrogate-specific bias in adversarial gradients for VLP models by subtracting the projection from a reference branch optimized on weak-semantic images.
-
Multi-channel Optical Vision Model
Spatial multiplexing in optical neural networks is repurposed as a trainable representational coordinate, demonstrated in multi-layer architectures for image classification, regression, and hybrid vision-language capt...
-
Spatiotemporal Graph Transformer for 3D Neighborhood Interaction and Quality Prediction in Metal Additive Manufacturing
A dual-attention graph transformer on a weighted network representation of fusing locations models cross-layer interactions to improve quality prediction in metal additive manufacturing over image, sequence, and graph...
-
EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video
EgoTactile benchmark and EgoPressureDiff diffusion framework for estimating full-hand grasp pressure from egocentric video.
-
NutriMLLM: Multimodal Large Language Models for Dietary Micronutrient Analysis
NutriMLLM models fine-tuned on 1.1 million synthetic food image-nutrient triplets from population dietary recalls achieve near-complete coverage and competitive accuracy on real food images for comprehensive micronutr...
-
AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens
AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token ...
-
Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets
ZeroSight supplies a video-derived dataset and evaluation protocol for genuine zero-shot composed image retrieval plus the SC4CIR consistency method, demonstrating that prior benchmarks inflate reported performance ac...
-
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
VLMs across families and scales show anchoring to discrete slant angles in zero-shot and prompted settings rather than human-like graded texture-based slant perception.
-
Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception
VLMs exhibit anchoring to discrete slant angles rather than graded responses across zero-shot, in-context, and fine-tuned settings, unlike human psychophysical patterns.
-
CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation
CIPER is a unified transformer that jointly performs cross-view image retrieval and 3-DoF pose estimation using shared encoder features, task-specific tokens, and bidirectional cross-attention.
-
Toward Calibrated, Fair, and accurate Deepfake Detection
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Reference graph
Works this paper leans on
-
[1]
Adaptive input representations for neural language modeling
9 Published as a conference paper at ICLR 2021 Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling. In ICLR,
work page 2021
-
[2]
Batch normalization: Accelerating deep network training by reducing internal covariate shift
10 Published as a conference paper at ICLR 2021 Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift
work page 2021
-
[3]
doi: 10.1137/0330046. URL https://doi.org/10.1137/0330046. Siyuan Qiao, Huiyu Wang, Chenxi Liu, Wei Shen, and Alan Yuille. Weight standardization. arXiv preprint arXiv:1903.10520,
-
[4]
Fixing the train-test resolution discrepancy
11 Published as a conference paper at ICLR 2021 Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy. In NeurIPS
work page 2021
-
[5]
Fixing the train-test resolution discrepancy: Fixefficientnet
Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Herve Jegou. Fixing the train-test resolution discrepancy: Fixefficientnet. arXiv preprint arXiv:2003.08237,
-
[6]
Axial-deeplab: Stand-alone axial-attention for panoptic segmentation
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020a. Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. arXiv preprint arXiv:2003.0...
-
[7]
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Xiaohua Zhai, Avital Oliver, Alexander Kolesnikov, and Lucas Beyer. S4L: Self-Supervised Semi- Supervised Learning. In ICCV, 2019a. Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning w...
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[8]
All models are trained with a batch size of 4096 and learn- ing rate warmup of 10k steps
12 Published as a conference paper at ICLR 2021 Models Dataset Epochs Base LR LR decay Weight decay Dropout ViT-B/{16,32} JFT-300M 7 8· 10−4 linear 0.1 0.0 ViT-L/32 JFT-300M 7 6· 10−4 linear 0.1 0.0 ViT-L/16 JFT-300M 7/14 4· 10−4 linear 0.1 0.0 ViT-H/14 JFT-300M 14 3· 10−4 linear 0.1 0.0 R50x{1,2} JFT-300M 7 10−3 linear 0.1 0.0 R101x1 JFT-300M 7 8· 10−4 l...
work page 2021
-
[9]
(2017)) is a popular building block for neural archi- tectures
APPENDIX A M ULTIHEAD SELF -ATTENTION Standard qkv self-attention (SA, Vaswani et al. (2017)) is a popular building block for neural archi- tectures. For each element in an input sequence z∈ RN×D, we compute a weighted sum over all values v in the sequence. The attention weights Aij are based on the pairwise similarity between two elements of the sequence...
work page 2017
-
[10]
For final results we train on the entire training set and evaluate on the respective test data
To do so, we use small sub-splits from the training set (10% for Pets and Flowers, 2% for CIFAR, 1% ImageNet) as development set and train on the remaining data. For final results we train on the entire training set and evaluate on the respective test data. For fine-tuning ResNets and hybrid models we use the exact same setup, with the only exception of Ima...
work page 2021
-
[11]
(2020) and select the best results across this run and our sweep
for ResNets we also run the setup of Kolesnikov et al. (2020) and select the best results across this run and our sweep. Finally, if not mentioned otherwise, all fine-tuning experiments run at 384 resolution (running fine-tuning at different resolution than training is common practice (Kolesnikov et al., 2020)). When transferring ViT models to another datas...
work page 2020
-
[12]
for all tasks. B.1.2 S ELF -SUPERVISION We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%). This setup is very similar to ...
work page 2019
-
[13]
We also experimented with 15% corruption rate as used by Devlin et al
because it has shown best few-shot performance. We also experimented with 15% corruption rate as used by Devlin et al. (2019) but results were also slightly worse on our few-shot metrics. Lastly, we would like to remark that our instantiation of masked patch prediction doesn’t require such an enormous amount of pretraining nor a large dataset such as JFT ...
work page 2019
-
[14]
These correspond to Figure 5 in the main paper
Epochs ImageNet ImageNet ReaL CIFAR-10 CIFAR-100 Pets Flowers exaFLOPs name ViT-B/32 7 80.73 86.27 98.61 90.49 93.40 99.27 55 ViT-B/16 7 84.15 88.85 99.00 91.87 95.80 99.56 224 ViT-L/32 7 84.37 88.28 99.19 92.52 95.83 99.45 196 ViT-L/16 7 86.30 89.43 99.38 93.46 96.81 99.66 783 ViT-L/16 14 87.12 89.99 99.38 94.04 97.11 99.56 1567 ViT-H/14 14 88.08 90.36 9...
work page 2021
-
[15]
This justifies the choice of Adam as the optimizer used to pre-train ResNets on JFT
Adam pre-training outperforms SGD pre-training on most datasets and on average. This justifies the choice of Adam as the optimizer used to pre-train ResNets on JFT. Note that the absolute numbers are lower than those reported by Kolesnikov et al. (2020), since we pre-train only for 7 epochs, not
work page 2020
-
[16]
Figure 8 shows 5-shot performance on ImageNet for different configurations
D.2 T RANSFORMER SHAPE We ran ablations on scaling different dimensions of the Transformer architecture to find out which are best suited for scaling to very large models. Figure 8 shows 5-shot performance on ImageNet for different configurations. All configurations are based on a ViT model with8 layers,D = 1024, DM LP = 2048 and a patch size of 32, the inte...
work page 2048
-
[17]
D.4 P OSITIONAL EMBEDDING We ran ablations on different ways of encoding spatial information using positional embedding. We tried the following cases: • Providing no positional information: Considering the inputs as a bag of patches. • 1-dimensional positional embedding: Considering the inputs as a sequence of patches in the raster order (default across a...
work page 2021
-
[18]
is a simple, yet effective technique to run self- attention on large inputs that are organized as multidimensional tensors. The general idea of axial attention is to perform multiple attention operations, each along a single axis of the input tensor, instead of applying 1-dimensional attention to the flattened version of the input. In axial attention, each...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.