super hub Canonical reference

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Aditya Ramesh, Alex Nichol, Bob McGrew, Pamela Mishkin, Prafulla Dhariwal, Pranav Shyam · 2021 · cs.CV · arXiv 2112.10741

Canonical reference. 81% of citing Pith papers cite this work as background.

134 Pith papers citing it

Background 81% of classified citations

open full Pith review browse 134 citing papers more from Aditya Ramesh arXiv PDF

abstract

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 23 baseline 2 method 2

citation-polarity summary

background 22 baseline 2 use method 2 support 1

claims ledger

abstract Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored

authors

Aditya Ramesh Alex Nichol Bob McGrew Pamela Mishkin Prafulla Dhariwal Pranav Shyam

co-cited works

representative citing papers

Consistency Models

cs.LG · 2023-03-02 · conditional · novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

cs.LG · 2022-09-07 · unverdicted · novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

Adversarial Domain Prompt Tuning and Generation for Single Domain Generalization

cs.CV · 2026-06-19 · unverdicted · novelty 7.0

PAPT uses adversarial prompt tuning on diffusion models to generate domain-style images while preserving category features, claiming superior single-domain generalization performance.

FlowBender: Feedback-Aware Training for Self-Correcting Conditional Flows

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

FlowBender introduces closed-loop training that lets conditional flow models learn correction policies from their own task-specific alignment errors, outperforming supervised and guidance baselines on fidelity and plausibility.

Towards More General Control of Diffusion Models Using Jeffrey Guidance

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

Jeffrey guidance applies Jeffrey's rule of conditioning to diffusion models to target prescribed marginal distributions while preserving conditional structure, demonstrated via embedding matching and fairness enforcement.

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

cs.CV · 2026-06-10 · unverdicted · novelty 7.0

Adv-TGD is a text-guided diffusion attack that achieves 85.9% black-box ASR on four face recognition models while preserving PSNR 28.18 dB and SSIM 0.981.

MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training

cs.CV · 2026-06-07 · unverdicted · novelty 7.0

MaskAlign uses random token-subset alignment and pre-mask mixing to reduce diffusion models' reliance on complete clean-image token sets during representation alignment.

AsyncPatch Diffusion: spatially-flexible image generation

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

AsyncPatch Diffusion introduces asynchronous per-region noise levels in diffusion models, proves a valid ELBO, and uses a controlled sampler to support spatially adaptive generation and native inpainting.

DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

DRIFT learns a structured invariance manifold from real images via one-class supervision on decomposed robust and fragile subspaces of a frozen VFM to detect AI-generated images through margin violations.

Reflection Separation from a Single Image via Joint Latent Diffusion

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A joint latent diffusion model with cross-layer self-attention and disjoint sampling separates reflection and transmission layers from single images more effectively than prior methods on real-world benchmarks.

GLENS: Global Search via Learning from Solver Iterates with Diffusion Models

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

GLENS uses diffusion models on solver iterates to generate high-quality and diverse initial guesses for multimodal non-convex optimization, leading to faster solver convergence.

DRM: Diffusion-based Reward Model With Step-wise Guidance

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

DRM turns a pre-trained diffusion model into a step-wise reward model and uses it for dense RL training (Step-wise GRPO) and guided sampling to improve final image quality.

VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

VDE accelerates rectified flow models like Flux by 3.22x with LPIPS of 0.069 via velocity decomposition into parallel/orthogonal components plus periodic full-pass anchoring.

Probability-Conserving Flow Guidance

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AdaMaG is a guidance rule for generative models derived from decomposing continuity-equation effects into divergence and score-parallel terms, with a proof that divergence diverges near the manifold and a time-dependent bound that improves realism at no extra cost.

Generating HDR Video from SDR Video

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

A multi-exposure video model predicts bracketed linear SDR sequences from single nonlinear SDR input, which a merging model combines into HDR video preserving shadow and highlight detail.

ImageAttributionBench: How Far Are We from Generalizable Attribution?

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

ImageAttributionBench is a benchmark dataset demonstrating that state-of-the-art image attribution methods lack robustness to image degradation and fail to generalize to semantically disjoint domains.

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

cs.CV · 2026-05-06 · unverdicted · novelty 7.0

RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.

Learning to Theorize the World from Observation

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

NEO is a probabilistic neural model that induces compositional programs as a learned Language of Thought from non-textual observations and executes them via a shared transition model to enable explanation-driven generalization.

Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

cs.CV · 2026-04-28 · conditional · novelty 7.0

Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.

GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

cs.CV · 2026-04-05 · unverdicted · novelty 7.0

GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

FrameDiT proposes Matrix Attention for DiTs to achieve SOTA video generation with improved temporal coherence and efficiency comparable to local factorized attention.

SVG360: Editable Multiview Vector Graphics from a Single SVG

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SVG360 lifts a single SVG to a view-conditioned representation, uses spatial memory to propagate consistent parts across views, and applies structure-aware vectorization to produce editable multiview SVGs.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding cs.CV · 2022-05-23 · accept · none · ref 41 · internal anchor
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
High-Resolution Image Synthesis with Latent Diffusion Models cs.CV · 2021-12-20 · conditional · none · ref 59 · internal anchor
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrained autoencoders with cross-attention conditioning, while cutting computational and

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer