Image Transformer

Alexander Ku; Ashish Vaswani; Dustin Tran; Jakob Uszkoreit; {\L}ukasz Kaiser; Niki Parmar; Noam Shazeer

arxiv: 1802.05751 · v3 · pith:U3GUI3TCnew · submitted 2018-02-15 · 💻 cs.CV

Image Transformer

Niki Parmar , Ashish Vaswani , Jakob Uszkoreit , {\L}ukasz Kaiser , Noam Shazeer , Alexander Ku , Dustin Tran This is my paper

classification 💻 cs.CV

keywords imagegenerationmodelself-attentionsignificantlyarchitecturehumanimagenet

0 comments

read the original abstract

Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the self-attention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. While conceptually simple, our generative models significantly outperform the current state of the art in image generation on ImageNet, improving the best published negative log-likelihood on ImageNet from 3.83 to 3.77. We also present results on image super-resolution with a large magnification ratio, applying an encoder-decoder configuration of our architecture. In a human evaluation study, we find that images generated by our super-resolution model fool human observers three times more often than the previous state of the art.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A document is worth a structured record: Principled inductive bias design for document recognition
cs.CV 2025-07 unverdicted novelty 8.0

Introduces a method to design structure-specific relational inductive biases for a base transformer architecture, enabling end-to-end transcription of documents with intrinsic structures, demonstrated on sheet music, ...
Improved Denoising Diffusion Probabilistic Models
cs.LG 2021-02 accept novelty 7.0

Targeted tweaks to DDPMs produce competitive likelihoods and high-quality samples, with learned reverse variances enabling 10x faster sampling and predictable scaling with compute.
Generating Long Sequences with Sparse Transformers
cs.LG 2019-04 unverdicted novelty 7.0

Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.
VideoGPT: Video Generation using VQ-VAE and Transformers
cs.CV 2021-04 accept novelty 6.0

VideoGPT generates competitive natural videos by learning discrete latents with VQ-VAE and modeling them autoregressively with a transformer.
Root Mean Square Layer Normalization
cs.LG 2019-10 conditional novelty 5.0

RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.