SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Andreas Blattmann; Dustin Podell; Joe Penna; Jonas M\"uller; Kyle Lacey; Robin Rombach; Tim Dockhorn; Zion English

arxiv: 2307.01952 · v1 · pith:B3L3LJBTnew · submitted 2023-07-04 · 💻 cs.CV · cs.AI

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell , Zion English , Kyle Lacey , Andreas Blattmann , Tim Dockhorn , Jonas M\"uller , Joe Penna , Robin Rombach This is my paper

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords SDXLlatent diffusiontext-to-image synthesisUNet scalingrefinement modelStable Diffusionconditioning schemeshigh-resolution generation

0 comments

The pith

SDXL scales up the UNet and adds conditioning plus refinement to make latent diffusion competitive with closed image generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDXL as an upgraded latent diffusion model for text-to-image generation that uses a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly by adding attention blocks and a second text encoder. It introduces new conditioning methods, trains across multiple aspect ratios, and pairs the base model with a separate refinement network that runs post-hoc image-to-image processing to raise visual quality. These changes produce outputs that exceed earlier open Stable Diffusion models and reach levels comparable to proprietary black-box systems. The work releases code and weights to support open research. Readers would care because it offers a transparent path to high-fidelity image synthesis without depending on closed commercial services.

Core claim

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results that

What carries the argument

The three-times-larger UNet backbone with added attention blocks and a second text encoder for expanded cross-attention context, together with novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model that performs image-to-image enhancement.

If this is right

Image synthesis quality improves markedly over earlier open Stable Diffusion releases.
The model handles variable aspect ratios without retraining.
A lightweight post-processing step further raises fidelity of base outputs.
Open release of weights and code enables community inspection and extension.
Performance reaches parity with certain closed commercial generators on visual metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open models may narrow the gap with proprietary systems through targeted architectural scaling rather than data secrecy alone.
The refinement stage could be adapted as a modular add-on for other diffusion pipelines.
Wider availability might shift user workflows away from paid API calls toward local or fine-tuned open alternatives.

Load-bearing premise

The reported gains in image quality come chiefly from the architectural scaling, conditioning additions, and refinement step rather than from undisclosed increases in training data volume, curation quality, or total compute.

What would settle it

A controlled re-training of SDXL and a prior Stable Diffusion baseline on identical data and hardware, followed by direct side-by-side evaluation on the same prompts, would show whether architecture alone explains the quality jump.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SDXL scales the UNet, adds a second text encoder and refinement stage, and ships open weights so the gains can be checked directly.

read the letter

SDXL scales the UNet, adds a second text encoder and refinement stage, and ships open weights so the gains can be checked directly. The authors describe a three-times larger UNet backbone, achieved mostly through extra attention blocks and expanded cross-attention context from the second encoder. They also bring in new conditioning methods, train across multiple aspect ratios, and use a separate refinement model for post-processing. The results show clear visual gains over earlier Stable Diffusion releases and competitive quality with closed systems. What the paper does well is lay out these changes in a straightforward way and back them with the release of code and model weights. That openness turns the empirical claims into something the community can test right away on their own prompts and metrics. The multi-aspect training and refinement step address real issues in high-res generation. The soft spots are around attribution. Without detailed breakdowns of training data size, curation, or compute budgets, it's difficult to know how much the architectural changes drive the improvement versus just scaling up resources. Side-by-side comparisons with proprietary models can be tricky to interpret fairly. These are common in this area, though, and not fatal here because the model is public. This work is for researchers and engineers focused on open diffusion models. Anyone looking to improve or deploy high-quality text-to-image systems will get practical value from the released artifacts. It is not pushing new theory, but the concrete implementation details and verifiability make it a useful addition. I would bring it to the next reading group to go over the conditioning schemes and see how they perform in practice. It deserves a serious referee because the core claims are testable and the open release supports proper evaluation.

Referee Report

1 major / 3 minor

Summary. The paper introduces SDXL, a latent diffusion model for text-to-image synthesis. It employs a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly through additional attention blocks and a second text encoder enabling larger cross-attention context. Novel conditioning schemes are proposed, the model is trained across multiple aspect ratios, and a refinement model is added for post-hoc image-to-image fidelity improvement. The central claim is that SDXL achieves drastically improved performance over previous Stable Diffusion versions while remaining competitive with black-box state-of-the-art generators; code and model weights are released.

Significance. If the empirical performance claims hold under independent verification, this constitutes a meaningful open contribution to high-resolution text-to-image synthesis by providing a transparent, reproducible baseline that can accelerate community research. The explicit release of code and weights is a clear strength that directly supports falsifiability of the reported gains.

major comments (1)

[Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.

minor comments (3)

[Methods] Clarify the exact training data composition and aspect-ratio sampling strategy in the methods section to allow readers to assess potential data-related confounds.
[Figures] Add captions to all qualitative figures that explicitly state the prompt, sampling parameters, and which model variant is shown in each panel.
[Abstract] Verify that the released GitHub repository contains the exact model weights, inference code, and evaluation scripts referenced in the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and recommendation of minor revision. We address the major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.

Authors: We agree that more explicit quantitative details would help clarify the contributions of the architectural and conditioning changes. The manuscript presents extensive qualitative results and some supporting metrics demonstrating the performance gains, but we acknowledge that additional tables with FID and CLIP scores on fixed benchmarks (e.g., MS-COCO), together with more detailed ablation controls, would strengthen the isolation of effects from the larger UNet, second text encoder, and novel conditioning. In the revised version we will add these tables and expand the description of training data aspects (including multi-aspect-ratio sampling) to the extent feasible. We note that the release of code and model weights directly enables independent verification, further ablations, and community evaluation on any desired benchmarks, which addresses the core concern of reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper describes an empirical engineering effort: a larger UNet with additional attention blocks, a second text encoder, novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model. The central claim of improved performance over prior Stable Diffusion versions and competitiveness with closed SOTA models rests on released weights, code, and external visual/qualitative comparisons rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the paper's own inputs by construction; the work is self-contained against verifiable external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest on standard latent diffusion assumptions plus empirical training; no new physical entities or ad-hoc constants are introduced beyond typical model hyperparameters.

free parameters (1)

UNet backbone scale
Model size increased by factor of three via more attention blocks; chosen through design and training to improve capacity.

axioms (1)

domain assumption Latent diffusion models generate images by iteratively denoising in a compressed latent space conditioned on text embeddings
Core modeling assumption underlying the entire SDXL architecture and training procedure.

pith-pipeline@v0.9.0 · 5469 in / 1277 out tokens · 78062 ms · 2026-05-10T15:16:09.227947+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm
cs.CV 2026-06 accept novelty 8.0

SARLO-80 is a new public dataset of 119566 complex SAR-optical-text triplets standardized to 80cm slant-range resolution from 257 locations across 72 countries.
On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models
cs.CR 2026-05 conditional novelty 8.0

Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...
When Do Diffusion Models learn to Generate Multiple Objects?
cs.CV 2026-04 unverdicted novelty 8.0

Diffusion models' multi-object generation is limited primarily by scene complexity and held-out combinations rather than imbalance, with counting difficult in low data and compositional generalization collapsing as mo...
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
Lie Group Diffusion Models for Hardware-Aware Quantum Circuit Synthesis
quant-ph 2026-06 unverdicted novelty 7.0

Lie group diffusion models combine a discrete circuit skeleton selector with continuous diffusion on SU(2) ≃ S³ to synthesize hardware-aware quantum circuits, outperforming baselines on three-qubit Hamiltonian simulat...
Diffusion Model Attribution via Spectral Coupling of Denoiser Responses
cs.CV 2026-06 unverdicted novelty 7.0

SDS extracts stable spectral signatures from diffusion model denoisers via frequency-controlled perturbations, achieving 99.9% attribution accuracy across eight models and 96.2% under prompt shift.
From Celebrities to Anyone: Characterizing AI Nudification Content, Technology, and Community Dynamics on 4chan
cs.CY 2026-06 unverdicted novelty 7.0

Large-scale study of AI nudification on 4chan identifies 24,105 items showing a shift to 55.8% non-celebrity targets and dominance of open-source models like Stable Diffusion.
Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE
cs.CV 2026-06 unverdicted novelty 7.0

SharpMoE is a plug-and-play post-training method that uses clean latent features and a trajectory routing loss to enable accurate saliency-based routing in diffusion MoE models for improved visual generation.
Do Image Editing Models Understand Lighting?
cs.CV 2026-06 unverdicted novelty 7.0

New 3DLP benchmark with real-world 1K HDR pairs shows state-of-the-art image editing models vary in physical lighting consistency, with best models close to reality but error-prone in low-light regions.
Trustworthy Image Authentication using Forensic Knowledge Graphs
cs.CV 2026-06 unverdicted novelty 7.0

Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, local...
DiT-Reward: Generative Representations for Text-to-Image Reward Modeling
cs.LG 2026-06 unverdicted novelty 7.0

DiT-Reward converts pretrained DiT models into reward predictors that outperform HPSv3 on four benchmarks while providing 1.65x inference speedup.
SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm
cs.CV 2026-06 conditional novelty 7.0

SARLO-80 provides 119,566 complex SAR-optical-text triplets at 80 cm slant-range resolution with fixed splits and preprocessing code.
Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion
cs.CV 2026-06 unverdicted novelty 7.0

Introduces Forged Calamity benchmark and shows that fine-tuned and zero-shot synthetic image detectors lose substantial accuracy on unseen generators and disaster types.
InterleaveThinker: Reinforcing Agentic Interleaved Generation
cs.CV 2026-06 unverdicted novelty 7.0

InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.
Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks
cs.CV 2026-06 unverdicted novelty 7.0

Adv-TGD is a text-guided diffusion attack that achieves 85.9% black-box ASR on four face recognition models while preserving PSNR 28.18 dB and SSIM 0.981.
IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation
cs.AI 2026-06 unverdicted novelty 7.0

IMUG-Bench is a new multi-turn interleaved image-text benchmark that exposes exposure bias in unified multimodal model generation and shows test-time scaling can mitigate it.
HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling
cs.CV 2026-06 unverdicted novelty 7.0

HACK++ is a head-aware KV cache compression framework for VAR models that decouples current-scale attention from historical cache under adaptive per-head budgets to achieve near-lossless generation at 30% attention an...
DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection
cs.CV 2026-06 unverdicted novelty 7.0

DRIFT learns a structured invariance manifold from real images via one-class supervision on decomposed robust and fragile subspaces of a frozen VFM to detect AI-generated images through margin violations.
Text-to-Image Models Need Less from Text Encoders Than You Think
cs.CV 2026-06 unverdicted novelty 7.0

A bag-of-position-tagged-words embedding guides text-to-image diffusion models as effectively as full contextual text embeddings from standard encoders.
ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation
cs.CR 2026-06 unverdicted novelty 7.0

ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.
OctoT2I: A Self-Evolving Agentic Text-to-Image Router
cs.AI 2026-06 unverdicted novelty 7.0

OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.
Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation
cs.CV 2026-05 unverdicted novelty 7.0

Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.
Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences
cs.CV 2026-05 unverdicted novelty 7.0

ASAP generates over 10K synthetic anatomical preference pairs via targeted degradation of high-fidelity images and applies a localized margin-bounded DPO to reduce anatomical errors in text-to-image human generation, ...
DRM: Diffusion-based Reward Model With Step-wise Guidance
cs.CV 2026-05 unverdicted novelty 7.0

DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.
Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo
cs.LG 2026-05 conditional novelty 7.0

TRI-TSMC is a trust-region framework for learning twisting functions in SMC-based inference-time alignment of diffusion models that yields zero-variance samplers in theory and better alignment on text and image tasks ...
Point Tracking Improves World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
cs.CV 2026-05 unverdicted novelty 7.0

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
cs.CV 2026-05 unverdicted novelty 7.0

Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control
eess.IV 2026-05 unverdicted novelty 7.0

GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.
AnyAct: Towards Human Reenactment of Character Motion From Video
cs.CV 2026-05 unverdicted novelty 7.0

AnyAct generates plausible human reenactments from non-human character videos via conditional motion generation from transferable sparse local 2D articulated cues, using human-only supervision, progressive training, a...
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression
cs.CV 2026-05 unverdicted novelty 7.0

OP4KSR enables efficient one-step 4K super-resolution without patches by adapting Flux with RoPE rescaling and periodicity loss to suppress artifacts.
Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles
cs.CY 2026-05 unverdicted novelty 7.0

A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.
ImageAttributionBench: How Far Are We from Generalizable Attribution?
cs.CV 2026-05 unverdicted novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 7.0

Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including vid...
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
cs.CV 2026-05 unverdicted novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models
cs.CV 2026-05 unverdicted novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Dependency-Aware Discrete Diffusion for Scene Graph Generation
cs.CV 2026-05 unverdicted novelty 7.0

A new discrete diffusion model for scene graph generation from text captures object-relation dependencies via hierarchical constraints and training-free conditioning, yielding better graph metrics and downstream image...
Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.
Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization
cs.CV 2026-05 unverdicted novelty 7.0

DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
cs.CV 2026-05 unverdicted novelty 7.0

D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
D-Rex : Diffusion Rendering for Relightable Expressive Avatars
cs.GR 2026-04 conditional novelty 7.0

D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.
GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models
cs.LG 2026-04 unverdicted novelty 7.0

GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.
Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution
cs.CV 2026-04 unverdicted novelty 7.0

IDaS-SR achieves one-step real-world super-resolution by bridging restoration and generation manifolds via adaptive inversion noise estimation and continuous trajectory steering.
Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation
cs.CV 2026-04 unverdicted novelty 7.0

Pose-LDM generates occluded in-bed images from keypoints to augment training data, achieving top accuracy under severe occlusion compared to other augmentation methods.
Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

DCMorph generates face morphs via decoupled cross-attention in identity-conditioned diffusion and DDIM spherical interpolation, achieving higher attack success rates on four face recognition systems than prior methods...
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
cs.CV 2026-04 unverdicted novelty 7.0

ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
cs.AI 2026-04 unverdicted novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.
Long-Text-to-Image Generation via Compositional Prompt Decomposition
cs.CV 2026-04 unverdicted novelty 7.0

PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...
Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion
cs.CV 2026-04 unverdicted novelty 7.0

3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.
LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories
cs.CV 2026-04 unverdicted novelty 7.0

LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Flow of Truth is the first proactive temporal forensics framework for image-to-video generation that uses a learnable forensic template following pixel motion and a template-guided flow module to decouple motion from content.
Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 372 Pith papers · 21 internal anchors

[1]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022

work page internal anchor Pith review arXiv 2022
[2]

arXiv preprint arXiv:2303.04248 , year=

David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023

work page arXiv 2023
[3]

arXiv preprint arXiv:2304.08818 , year=

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023

work page arXiv 2023
[4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009
[5]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

work page internal anchor Pith review arXiv 2021
[6]

Distilling the Knowledge in Diffusion Models

Tim Dockhorn, Robin Rombach, Andreas Blattmann, and Yaoliang Yu. Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023

work page 2023
[7]

Structure and content-guided video synthesis with diffusion models, 2023

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models, 2023

work page 2023
[8]

Training-free structured diffusion guidance for compositional text-to-image synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023

work page arXiv 2023
[9]

Riffusion - Stable diffusion for real-time music generation, 2022

Seth Forsgren and Hayk Martiros. Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about

work page 2022
[10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022

work page internal anchor Pith review arXiv 2022
[11]

Diffusion with offset noise, 2023

Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs. org/blog/diffusion-with-offset-noise

work page 2023
[12]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017

work page Pith review arXiv 2017
[13]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020

work page internal anchor Pith review arXiv 2006
[15]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, and Tim Salimans. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022

work page internal anchor Pith review arXiv 2022
[16]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023

work page arXiv 2023
[17]

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023

work page arXiv 2023
[18]

Estimation of Non-Normalized Statistical Models by Score Matching

Aapo Hyvärinen and Peter Dayan. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005

work page 2005
[19]

Shamsi et al

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021
[20]

Distribution Augmentation for Generative Modeling

Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Ramesh, Alec Radford, and Ilya Sutskever. Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020

work page 2020
[21]

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022

work page internal anchor Pith review arXiv 2022
[22]

On architectural compression of text-to-image diffusion models

Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023. 19

work page arXiv 2023
[23]

Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023

work page arXiv 2023
[24]

2023.SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023

work page arXiv 2023
[25]

Flow matching for generative modeling.arXiv preprint arXiv:2305.08891, 2023

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023

work page arXiv 2023
[26]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

work page 2015
[27]

Character-aware models improve visual text rendering, 2023

Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering, 2023

work page 2023
[28]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021

work page internal anchor Pith review arXiv 2021
[29]

Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023

work page 2023
[30]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021
[31]

Novelai improvements on stable diffusion, 2023

NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/ novelai-improvements-on-stable-diffusion-e10d38db82ac

work page 2023
[32]

Pytorch: An imperative style, high-performance deep learning library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page 2019
[33]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022

work page internal anchor Pith review arXiv 2022
[34]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

How dall·e 2 works, 2022

Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2. html

work page 2022
[36]

Zero-shot text-to-image generation, 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

work page 2021
[37]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022

work page internal anchor Pith review arXiv 2022
[38]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021

work page Pith review arXiv 2021
[39]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022

work page internal anchor Pith review arXiv 2022
[41]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review arXiv 2022
[42]

Improved Techniques for Training GANs

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. arXiv:1606.03498, 2016

work page Pith review arXiv 2016
[43]

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023

work page arXiv 2023
[44]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022
[45]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015. 20

work page internal anchor Pith review arXiv 2015
[46]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[47]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[48]

Evaluating a synthetic image dataset generated with stable diffusion

Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022

work page arXiv 2022
[49]

High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity

Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

work page 2023
[50]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Boosting gui prototyping with diffusion models

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, and Gérard Dray. Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023

work page arXiv 2023
[52]

Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

work page 2022
[53]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

work page 2022
[54]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023

work page internal anchor Pith review arXiv 2023
[55]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. 21

work page 2018

[1] [1]

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022

work page internal anchor Pith review arXiv 2022

[2] [2]

arXiv preprint arXiv:2303.04248 , year=

David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023

work page arXiv 2023

[3] [3]

arXiv preprint arXiv:2304.08818 , year=

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023

work page arXiv 2023

[4] [4]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

work page 2009

[5] [5]

Diffusion Models Beat GANs on Image Synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

work page internal anchor Pith review arXiv 2021

[6] [6]

Distilling the Knowledge in Diffusion Models

Tim Dockhorn, Robin Rombach, Andreas Blattmann, and Yaoliang Yu. Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023

work page 2023

[7] [7]

Structure and content-guided video synthesis with diffusion models, 2023

Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models, 2023

work page 2023

[8] [8]

Training-free structured diffusion guidance for compositional text-to-image synthesis

Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023

work page arXiv 2023

[9] [9]

Riffusion - Stable diffusion for real-time music generation, 2022

Seth Forsgren and Hayk Martiros. Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about

work page 2022

[10] [10]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022

work page internal anchor Pith review arXiv 2022

[11] [11]

Diffusion with offset noise, 2023

Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs. org/blog/diffusion-with-offset-noise

work page 2023

[12] [12]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017

work page Pith review arXiv 2017

[13] [13]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020

work page internal anchor Pith review arXiv 2006

[15] [15]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, and Tim Salimans. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022

work page internal anchor Pith review arXiv 2022

[16] [16]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023

work page arXiv 2023

[17] [17]

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,

Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023

work page arXiv 2023

[18] [18]

Estimation of Non-Normalized Statistical Models by Score Matching

Aapo Hyvärinen and Peter Dayan. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005

work page 2005

[19] [19]

Shamsi et al

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773

work page doi:10.5281/zenodo.5143773 2021

[20] [20]

Distribution Augmentation for Generative Modeling

Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Ramesh, Alec Radford, and Ilya Sutskever. Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020

work page 2020

[21] [21]

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022

work page internal anchor Pith review arXiv 2022

[22] [22]

On architectural compression of text-to-image diffusion models

Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023. 19

work page arXiv 2023

[23] [23]

Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023

work page arXiv 2023

[24] [24]

2023.SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023

work page arXiv 2023

[25] [25]

Flow matching for generative modeling.arXiv preprint arXiv:2305.08891, 2023

Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023

work page arXiv 2023

[26] [26]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

work page 2015

[27] [27]

Character-aware models improve visual text rendering, 2023

Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering, 2023

work page 2023

[28] [28]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021

work page internal anchor Pith review arXiv 2021

[29] [29]

Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans

Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023

work page 2023

[30] [30]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. arXiv:2112.10741, 2021

work page internal anchor Pith review arXiv 2021

[31] [31]

Novelai improvements on stable diffusion, 2023

NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/ novelai-improvements-on-stable-diffusion-e10d38db82ac

work page 2023

[32] [32]

Pytorch: An imperative style, high-performance deep learning library, 2019

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

work page 2019

[33] [33]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022

work page internal anchor Pith review arXiv 2022

[34] [34]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

How dall·e 2 works, 2022

Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2. html

work page 2022

[36] [36]

Zero-shot text-to-image generation, 2021

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

work page 2021

[37] [37]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022

work page internal anchor Pith review arXiv 2022

[38] [38]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021

work page Pith review arXiv 2021

[39] [39]

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022

work page internal anchor Pith review arXiv 2022

[41] [41]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022

work page internal anchor Pith review arXiv 2022

[42] [42]

Improved Techniques for Training GANs

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. arXiv:1606.03498, 2016

work page Pith review arXiv 2016

[43] [43]

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023

work page arXiv 2023

[44] [44]

Make-A-Video: Text-to-Video Generation without Text-Video Data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022

work page internal anchor Pith review arXiv 2022

[45] [45]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015. 20

work page internal anchor Pith review arXiv 2015

[46] [46]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[47] [47]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011

[48] [48]

Evaluating a synthetic image dataset generated with stable diffusion

Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022

work page arXiv 2022

[49] [49]

High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity

Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

work page 2023

[50] [50]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Boosting gui prototyping with diffusion models

Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, and Gérard Dray. Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023

work page arXiv 2023

[52] [52]

Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

work page 2022

[53] [53]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

work page 2022

[54] [54]

Adding Conditional Control to Text-to-Image Diffusion Models

Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023

work page internal anchor Pith review arXiv 2023

[55] [55]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. 21

work page 2018