pith. sign in

arxiv: 2307.01952 · v1 · pith:B3L3LJBTnew · submitted 2023-07-04 · 💻 cs.CV · cs.AI

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Pith reviewed 2026-05-10 15:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords SDXLlatent diffusiontext-to-image synthesisUNet scalingrefinement modelStable Diffusionconditioning schemeshigh-resolution generation
0
0 comments X

The pith

SDXL scales up the UNet and adds conditioning plus refinement to make latent diffusion competitive with closed image generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SDXL as an upgraded latent diffusion model for text-to-image generation that uses a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly by adding attention blocks and a second text encoder. It introduces new conditioning methods, trains across multiple aspect ratios, and pairs the base model with a separate refinement network that runs post-hoc image-to-image processing to raise visual quality. These changes produce outputs that exceed earlier open Stable Diffusion models and reach levels comparable to proprietary black-box systems. The work releases code and weights to support open research. Readers would care because it offers a transparent path to high-fidelity image synthesis without depending on closed commercial services.

Core claim

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results that

What carries the argument

The three-times-larger UNet backbone with added attention blocks and a second text encoder for expanded cross-attention context, together with novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model that performs image-to-image enhancement.

If this is right

  • Image synthesis quality improves markedly over earlier open Stable Diffusion releases.
  • The model handles variable aspect ratios without retraining.
  • A lightweight post-processing step further raises fidelity of base outputs.
  • Open release of weights and code enables community inspection and extension.
  • Performance reaches parity with certain closed commercial generators on visual metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open models may narrow the gap with proprietary systems through targeted architectural scaling rather than data secrecy alone.
  • The refinement stage could be adapted as a modular add-on for other diffusion pipelines.
  • Wider availability might shift user workflows away from paid API calls toward local or fine-tuned open alternatives.

Load-bearing premise

The reported gains in image quality come chiefly from the architectural scaling, conditioning additions, and refinement step rather than from undisclosed increases in training data volume, curation quality, or total compute.

What would settle it

A controlled re-training of SDXL and a prior Stable Diffusion baseline on identical data and hardware, followed by direct side-by-side evaluation on the same prompts, would show whether architecture alone explains the quality jump.

read the original abstract

We present SDXL, a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone: The increase of model parameters is mainly due to more attention blocks and a larger cross-attention context as SDXL uses a second text encoder. We design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. We also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL shows drastically improved performance compared the previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators. In the spirit of promoting open research and fostering transparency in large model training and evaluation, we provide access to code and model weights at https://github.com/Stability-AI/generative-models

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces SDXL, a latent diffusion model for text-to-image synthesis. It employs a UNet backbone three times larger than prior Stable Diffusion versions, achieved mainly through additional attention blocks and a second text encoder enabling larger cross-attention context. Novel conditioning schemes are proposed, the model is trained across multiple aspect ratios, and a refinement model is added for post-hoc image-to-image fidelity improvement. The central claim is that SDXL achieves drastically improved performance over previous Stable Diffusion versions while remaining competitive with black-box state-of-the-art generators; code and model weights are released.

Significance. If the empirical performance claims hold under independent verification, this constitutes a meaningful open contribution to high-resolution text-to-image synthesis by providing a transparent, reproducible baseline that can accelerate community research. The explicit release of code and weights is a clear strength that directly supports falsifiability of the reported gains.

major comments (1)
  1. [Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.
minor comments (3)
  1. [Methods] Clarify the exact training data composition and aspect-ratio sampling strategy in the methods section to allow readers to assess potential data-related confounds.
  2. [Figures] Add captions to all qualitative figures that explicitly state the prompt, sampling parameters, and which model variant is shown in each panel.
  3. [Abstract] Verify that the released GitHub repository contains the exact model weights, inference code, and evaluation scripts referenced in the paper.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and recommendation of minor revision. We address the major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and results sections] The central empirical claim (drastically improved performance and competitiveness with closed SOTA models) rests on comparisons whose quantitative details, ablation controls, and dataset statistics are not fully elaborated in the provided text. Without explicit tables reporting metrics such as FID or CLIP scores on fixed benchmarks, together with controls for training data volume and curation, it remains difficult to isolate the contribution of the architectural changes (larger UNet, second text encoder, conditioning schemes) from possible differences in compute or data.

    Authors: We agree that more explicit quantitative details would help clarify the contributions of the architectural and conditioning changes. The manuscript presents extensive qualitative results and some supporting metrics demonstrating the performance gains, but we acknowledge that additional tables with FID and CLIP scores on fixed benchmarks (e.g., MS-COCO), together with more detailed ablation controls, would strengthen the isolation of effects from the larger UNet, second text encoder, and novel conditioning. In the revised version we will add these tables and expand the description of training data aspects (including multi-aspect-ratio sampling) to the extent feasible. We note that the release of code and model weights directly enables independent verification, further ablations, and community evaluation on any desired benchmarks, which addresses the core concern of reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims

full rationale

The paper describes an empirical engineering effort: a larger UNet with additional attention blocks, a second text encoder, novel conditioning schemes, multi-aspect-ratio training, and a post-hoc refinement model. The central claim of improved performance over prior Stable Diffusion versions and competitiveness with closed SOTA models rests on released weights, code, and external visual/qualitative comparisons rather than any internal derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the paper's own inputs by construction; the work is self-contained against verifiable external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The performance claims rest on standard latent diffusion assumptions plus empirical training; no new physical entities or ad-hoc constants are introduced beyond typical model hyperparameters.

free parameters (1)
  • UNet backbone scale
    Model size increased by factor of three via more attention blocks; chosen through design and training to improve capacity.
axioms (1)
  • domain assumption Latent diffusion models generate images by iteratively denoising in a compressed latent space conditioned on text embeddings
    Core modeling assumption underlying the entire SDXL architecture and training procedure.

pith-pipeline@v0.9.0 · 5469 in / 1277 out tokens · 78062 ms · 2026-05-10T15:16:09.227947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

    cs.CV 2026-06 accept novelty 8.0

    SARLO-80 is a new public dataset of 119566 complex SAR-optical-text triplets standardized to 80cm slant-range resolution from 257 locations across 72 countries.

  2. On the Generation and Mitigation of Harmful Geometry in Image-to-3D Models

    cs.CR 2026-05 conditional novelty 8.0

    Image-to-3D models successfully generate harmful geometries in most cases with under 0.3% caught by commercial filters; existing safeguards are weak but a stacked defense cuts harmful outputs to under 1% at 11% false-...

  3. When Do Diffusion Models learn to Generate Multiple Objects?

    cs.CV 2026-04 unverdicted novelty 8.0

    Diffusion models' multi-object generation is limited primarily by scene complexity and held-out combinations rather than imbalance, with counting difficult in low data and compositional generalization collapsing as mo...

  4. Flow-GRPO: Training Flow Matching Models via Online RL

    cs.CV 2025-05 unverdicted novelty 8.0

    Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

  5. Lie Group Diffusion Models for Hardware-Aware Quantum Circuit Synthesis

    quant-ph 2026-06 unverdicted novelty 7.0

    Lie group diffusion models combine a discrete circuit skeleton selector with continuous diffusion on SU(2) ≃ S³ to synthesize hardware-aware quantum circuits, outperforming baselines on three-qubit Hamiltonian simulat...

  6. Diffusion Model Attribution via Spectral Coupling of Denoiser Responses

    cs.CV 2026-06 unverdicted novelty 7.0

    SDS extracts stable spectral signatures from diffusion model denoisers via frequency-controlled perturbations, achieving 99.9% attribution accuracy across eight models and 96.2% under prompt shift.

  7. From Celebrities to Anyone: Characterizing AI Nudification Content, Technology, and Community Dynamics on 4chan

    cs.CY 2026-06 unverdicted novelty 7.0

    Large-scale study of AI nudification on 4chan identifies 24,105 items showing a shift to 55.8% non-celebrity targets and dominance of open-source models like Stable Diffusion.

  8. Focusing on What Matters: Saliency-Harnessing Accurate Routing for Diffusion MoE

    cs.CV 2026-06 unverdicted novelty 7.0

    SharpMoE is a plug-and-play post-training method that uses clean latent features and a trajectory routing loss to enable accurate saliency-based routing in diffusion MoE models for improved visual generation.

  9. Do Image Editing Models Understand Lighting?

    cs.CV 2026-06 unverdicted novelty 7.0

    New 3DLP benchmark with real-world 1K HDR pairs shows state-of-the-art image editing models vary in physical lighting consistency, with best models close to reality but error-prone in low-light regions.

  10. Trustworthy Image Authentication using Forensic Knowledge Graphs

    cs.CV 2026-06 unverdicted novelty 7.0

    Forensic Knowledge Graphs integrate forensic traces, causal dependencies, and scene links via a new authentication network and Iterative Context Refinement to outperform standard detectors and VLMs on detection, local...

  11. DiT-Reward: Generative Representations for Text-to-Image Reward Modeling

    cs.LG 2026-06 unverdicted novelty 7.0

    DiT-Reward converts pretrained DiT models into reward predictors that outperform HPSv3 on four benchmarks while providing 1.65x inference speedup.

  12. SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

    cs.CV 2026-06 conditional novelty 7.0

    SARLO-80 provides 119,566 complex SAR-optical-text triplets at 80 cm slant-range resolution with fixed splits and preprocessing code.

  13. Forged Calamity: Benchmark for Cross-Domain Synthetic Disaster Detection in the Age of Diffusion

    cs.CV 2026-06 unverdicted novelty 7.0

    Introduces Forged Calamity benchmark and shows that fine-tuned and zero-shot synthetic image detectors lose substantial accuracy on unseen generators and disaster types.

  14. InterleaveThinker: Reinforcing Agentic Interleaved Generation

    cs.CV 2026-06 unverdicted novelty 7.0

    InterleaveThinker is the first multi-agent pipeline enabling interleaved generation in any image generator through planner-critic agents, SFT on custom datasets, and GRPO RL with accuracy and step-wise rewards.

  15. Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

    cs.CV 2026-06 unverdicted novelty 7.0

    Adv-TGD is a text-guided diffusion attack that achieves 85.9% black-box ASR on four face recognition models while preserving PSNR 28.18 dB and SSIM 0.981.

  16. IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

    cs.AI 2026-06 unverdicted novelty 7.0

    IMUG-Bench is a new multi-turn interleaved image-text benchmark that exposes exposure bias in unified multimodal model generation and shows test-time scaling can mitigate it.

  17. HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

    cs.CV 2026-06 unverdicted novelty 7.0

    HACK++ is a head-aware KV cache compression framework for VAR models that decouples current-scale attention from historical cache under adaptive per-head budgets to achieve near-lossless generation at 30% attention an...

  18. DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

    cs.CV 2026-06 unverdicted novelty 7.0

    DRIFT learns a structured invariance manifold from real images via one-class supervision on decomposed robust and fragile subspaces of a frozen VFM to detect AI-generated images through margin violations.

  19. Text-to-Image Models Need Less from Text Encoders Than You Think

    cs.CV 2026-06 unverdicted novelty 7.0

    A bag-of-position-tagged-words embedding guides text-to-image diffusion models as effectively as full contextual text embeddings from standard encoders.

  20. ImageAuditor: Membership Inference Attack against Image-based Retrieval-Augmented Generation

    cs.CR 2026-06 unverdicted novelty 7.0

    ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.

  21. OctoT2I: A Self-Evolving Agentic Text-to-Image Router

    cs.AI 2026-06 unverdicted novelty 7.0

    OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.

  22. Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    Orthogonal Negative Guidance subtracts only the orthogonal component of negative-prompt attention features from positive ones in FLUX models to suppress concepts while preserving semantics and quality.

  23. Towards Anatomically Plausible Human Image Generation via Synthetic Localized Preferences

    cs.CV 2026-05 unverdicted novelty 7.0

    ASAP generates over 10K synthetic anatomical preference pairs via targeted degradation of high-fidelity images and applies a localized margin-bounded DPO to reduce anatomical errors in text-to-image human generation, ...

  24. DRM: Diffusion-based Reward Model With Step-wise Guidance

    cs.CV 2026-05 unverdicted novelty 7.0

    DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.

  25. Inference-Time Alignment of Diffusion Models via Trust-Region Iterative Twisted Sequential Monte Carlo

    cs.LG 2026-05 conditional novelty 7.0

    TRI-TSMC is a trust-region framework for learning twisting functions in SMC-based inference-time alignment of diffusion models that yields zero-variance samplers in theory and better alignment on text and image tasks ...

  26. Point Tracking Improves World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.

  27. Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

  28. Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.

  29. GeoDiff-SAR II: 3D-Driven Foundation Diffusion Models for SAR Generation via Decoupled Control

    eess.IV 2026-05 unverdicted novelty 7.0

    GeoDiff-SAR II proposes a 3D-driven decoupled diffusion framework using GECM and ControlNet on a FLUX backbone for controllable SAR image generation across large viewpoint gaps.

  30. AnyAct: Towards Human Reenactment of Character Motion From Video

    cs.CV 2026-05 unverdicted novelty 7.0

    AnyAct generates plausible human reenactments from non-human character videos via conditional motion generation from transferable sparse local 2D articulated cues, using human-only supervision, progressive training, a...

  31. From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

    cs.CV 2026-05 unverdicted novelty 7.0

    A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.

  32. OP4KSR: One-Step Patch-Free 4K Super-Resolution with Periodic Artifact Suppression

    cs.CV 2026-05 unverdicted novelty 7.0

    OP4KSR enables efficient one-step 4K super-resolution without patches by adapting Flux with RoPE rescaling and periodicity loss to suppress artifacts.

  33. Context Matters: Auditing Gender Bias in T2I Generation through Risk-Tiered Use-Case Profiles

    cs.CY 2026-05 unverdicted novelty 7.0

    A new framework called THUMB cards organizes gender bias metrics for T2I models by risk-tiered use cases, measurement categories, and harm typologies aligned with the EU AI Act.

  34. ImageAttributionBench: How Far Are We from Generalizable Attribution?

    cs.CV 2026-05 unverdicted novelty 7.0

    ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

  35. Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

  36. Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

    cs.CV 2026-05 unverdicted novelty 7.0

    Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including vid...

  37. Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...

  38. LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR

    cs.CV 2026-05 unverdicted novelty 7.0

    LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.

  39. ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

  40. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  41. Dependency-Aware Discrete Diffusion for Scene Graph Generation

    cs.CV 2026-05 unverdicted novelty 7.0

    A new discrete diffusion model for scene graph generation from text captures object-relation dependencies via hierarchical constraints and training-free conditioning, yielding better graph metrics and downstream image...

  42. Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    ArenaPO infers Gaussian capability distributions from pairwise preferences and applies truncated-normal latent inference to derive fine-grained offline rewards for preference optimization of text-to-image diffusion models.

  43. Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization

    cs.CV 2026-05 unverdicted novelty 7.0

    DPOFusion uses direct preference optimization on property-aligned and preference-controllable latent diffusion models to produce adaptive infrared-visible image fusions aligned with heterogeneous human and machine vis...

  44. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 7.0

    D-OPSD formulates supervised fine-tuning of step-distilled diffusion models as on-policy self-distillation by minimizing distribution differences between a text-only student and a multimodal teacher on the student's o...

  45. From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.

  46. D-Rex : Diffusion Rendering for Relightable Expressive Avatars

    cs.GR 2026-04 conditional novelty 7.0

    D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.

  47. GeoEdit: Local Frames for Fast, Training-Free On-Manifold Editing in Diffusion Models

    cs.LG 2026-04 unverdicted novelty 7.0

    GeoEdit constructs local tangent frames from small perturbations to initial noise, enabling Jacobian-free on-manifold edits in diffusion models via alternating tangent steps and diffusion projections.

  48. Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution

    cs.CV 2026-04 unverdicted novelty 7.0

    IDaS-SR achieves one-step real-world super-resolution by bridging restoration and generation manifolds via adaptive inversion noise estimation and continuous trajectory steering.

  49. Geometry-Conditioned Diffusion for Occlusion-Robust In-Bed Pose Estimation

    cs.CV 2026-04 unverdicted novelty 7.0

    Pose-LDM generates occluded in-bed images from keypoints to augment training data, achieving top accuracy under severe occlusion compared to other augmentation methods.

  50. Oracle Noise: Faster Semantic Spherical Alignment for Interpretable Latent Optimization

    cs.CV 2026-04 unverdicted novelty 7.0

    Oracle Noise optimizes diffusion model noise on a Riemannian hypersphere guided by key prompt words to preserve the Gaussian prior, eliminate norm inflation, and achieve faster semantic alignment than Euclidean methods.

  51. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  52. DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    DCMorph generates face morphs via decoupled cross-attention in identity-conditioned diffusion and DDIM spherical interpolation, achieving higher attack success rates on four face recognition systems than prior methods...

  53. ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

    cs.CV 2026-04 unverdicted novelty 7.0

    ReImagine decouples human appearance from temporal consistency via pretrained image backbones, SMPL-X motion guidance, and training-free video diffusion refinement to generate high-quality controllable videos.

  54. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  55. Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GDMD replaces raw-sample rewards with distillation-gradient rewards in RL-guided diffusion distillation, yielding 4-step models that surpass their multi-step teachers on GenEval and human preference metrics.

  56. Long-Text-to-Image Generation via Compositional Prompt Decomposition

    cs.CV 2026-04 unverdicted novelty 7.0

    PRISM lets pre-trained text-to-image models handle long prompts by breaking them into compositional parts, predicting noise separately, and merging outputs via energy-based conjunction, matching fine-tuned models whil...

  57. Co-generation of Layout and Shape from Text via Autoregressive 3D Diffusion

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-ARD+ unifies autoregressive token prediction with diffusion-based 3D latent generation to co-produce indoor scene layouts and object geometries that follow complex text-specified spatial and semantic constraints.

  58. LeapAlign: Post-Training Flow Matching Models at Any Generation Step by Building Two-Step Trajectories

    cs.CV 2026-04 unverdicted novelty 7.0

    LeapAlign fine-tunes flow matching models by constructing two consecutive leaps that skip multiple ODE steps with randomized timesteps and consistency weighting, enabling stable updates at any generation step.

  59. Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow of Truth is the first proactive temporal forensics framework for image-to-video generation that uses a learnable forensic template following pixel motion and a template-guided flow module to decouple motion from content.

  60. Flow of Truth: Proactive Temporal Forensics for Image-to-Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Flow of Truth introduces a learnable forensic template and template-guided flow module that follows pixel motion to enable temporal tracing in image-to-video generation.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · cited by 372 Pith papers · 21 internal anchors

  1. [1]

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv:2211.01324, 2022

  2. [2]

    arXiv preprint arXiv:2303.04248 , year=

    David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbot, and Eric Gu. TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation. arXiv:2303.04248, 2023

  3. [3]

    arXiv preprint arXiv:2304.08818 , year=

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023

  4. [4]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  5. [5]

    Diffusion Models Beat GANs on Image Synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021

  6. [6]

    Distilling the Knowledge in Diffusion Models

    Tim Dockhorn, Robin Rombach, Andreas Blattmann, and Yaoliang Yu. Distilling the Knowledge in Diffusion Models. CVPR Workshop on Generative Models for Computer Vision, 2023

  7. [7]

    Structure and content-guided video synthesis with diffusion models, 2023

    Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models, 2023

  8. [8]

    Training-free structured diffusion guidance for compositional text-to-image synthesis

    Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv:2212.05032, 2023

  9. [9]

    Riffusion - Stable diffusion for real-time music generation, 2022

    Seth Forsgren and Hayk Martiros. Riffusion - Stable diffusion for real-time music generation, 2022. URL https://riffusion.com/about

  10. [10]

    An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv:2208.01618, 2022

  11. [11]

    Diffusion with offset noise, 2023

    Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023. URL https://www.crosslabs. org/blog/diffusion-with-offset-noise

  12. [12]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv:1706.08500, 2017

  13. [13]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022

  14. [14]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv preprint arXiv:2006.11239, 2020

  15. [15]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, and Tim Salimans. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022

  16. [16]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023

  17. [17]

    Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,

    Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arXiv:2301.12661, 2023

  18. [18]

    Estimation of Non-Normalized Statistical Models by Score Matching

    Aapo Hyvärinen and Peter Dayan. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005

  19. [19]

    Shamsi et al

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo.5143773

  20. [20]

    Distribution Augmentation for Generative Modeling

    Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Ramesh, Alec Radford, and Ilya Sutskever. Distribution Augmentation for Generative Modeling. In International Conference on Machine Learning, pages 5006–5019. PMLR, 2020

  21. [21]

    Elucidating the Design Space of Diffusion-Based Generative Models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022

  22. [22]

    On architectural compression of text-to-image diffusion models

    Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, and Shinkook Choi. On Architectural Compression of Text-to-Image Diffusion Models. arXiv:2305.15798, 2023. 19

  23. [23]

    Pick-a-pic: An open dataset of user preferences for text-to-image generation, 2023

    Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic: An open dataset of user preferences for text-to-image generation. arXiv:2305.01569, 2023

  24. [24]

    2023.SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

    Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds. arXiv:2306.00980, 2023

  25. [25]

    Flow matching for generative modeling.arXiv preprint arXiv:2305.08891, 2023

    Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023

  26. [26]

    Lawrence Zitnick, and Piotr Dollár

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015

  27. [27]

    Character-aware models improve visual text rendering, 2023

    Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering, 2023

  28. [28]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. arXiv:2108.01073, 2021

  29. [29]

    Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans

    Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models, 2023

  30. [30]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text- Guided Diffusion Models. arXiv:2112.10741, 2021

  31. [31]

    Novelai improvements on stable diffusion, 2023

    NovelAI. Novelai improvements on stable diffusion, 2023. URL https://blog.novelai.net/ novelai-improvements-on-stable-diffusion-e10d38db82ac

  32. [32]

    Pytorch: An imperative style, high-performance deep learning library, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performa...

  33. [33]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable Diffusion Models with Transformers. arXiv:2212.09748, 2022

  34. [34]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021

  35. [35]

    How dall·e 2 works, 2022

    Aditya Ramesh. How dall·e 2 works, 2022. URL http://adityaramesh.com/posts/dalle2/dalle2. html

  36. [36]

    Zero-shot text-to-image generation, 2021

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation, 2021

  37. [37]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022

  38. [38]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021

  39. [39]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015

  40. [40]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv:2205.11487, 2022

  41. [41]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022

  42. [42]

    Improved Techniques for Training GANs

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training GANs. arXiv:1606.03498, 2016

  43. [43]

    DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

    Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, and Xinxiao Wu. DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification. arXiv:2305.15957, 2023

  44. [44]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022

  45. [45]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015. 20

  46. [46]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

  47. [47]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020

  48. [48]

    Evaluating a synthetic image dataset generated with stable diffusion

    Andreas Stöckl. Evaluating a synthetic image dataset generated with stable diffusion. arXiv:2211.01777, 2022

  49. [49]

    High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity

    Yu Takagi and Shinji Nishimoto. High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023

  50. [50]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971, 2023

  51. [51]

    Boosting gui prototyping with diffusion models

    Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, and Gérard Dray. Boosting gui prototyping with diffusion models. arXiv preprint arXiv:2306.06233, 2023

  52. [52]

    Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

    Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

  53. [53]

    Scaling autoregressive models for content-rich text-to-image generation, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022

  54. [54]

    Adding Conditional Control to Text-to-Image Diffusion Models

    Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv:2302.05543, 2023

  55. [55]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric, 2018. 21