hub

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

· 2025 · cs.CV · arXiv 2511.19365

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

open full Pith review browse 16 citing papers arXiv PDF

abstract

Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

Coevolving Representations in Joint Image-Feature Diffusion

cs.CV · 2026-04-19 · unverdicted · novelty 7.0

CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.

PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion

cs.CV · 2026-06-26 · unverdicted · novelty 6.0

PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.

BareWave: Waveform-Native Flow-Matching Text-to-Speech

eess.AS · 2026-06-08 · unverdicted · novelty 6.0

BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

MIND integrates discrete patch tokenization into diffusion score functions via soft top-k and dual-branch layers, achieving FID 22.73 (no guidance) and 2.06 (with guidance) on ImageNet-256 after 80 epochs, outperforming DiT and larger LlamaGen models.

L2P: Unlocking Latent Potential for Pixel Generation

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.

FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.

CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

cs.CV · 2026-02-02 · accept · novelty 6.0

PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.

Trajectory Forcing: Structure-First Generation with Controllable Semantic Trajectories

cs.CV · 2026-06-21 · unverdicted · novelty 5.0

Trajectory Forcing makes generative image synthesis trajectory-centric by organizing it into decodable semantic stages derived from clustered visual representations and trained with one-step flow-matching models.

SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs

cs.CV · 2026-06-17 · unverdicted · novelty 5.0

SpectralDiT adds timestep-conditioned spectral residual correction to flow-matching DiTs and reports FID reductions of 5% on CIFAR-10 and 8.7% relative on ImageNet-100 latent diffusion with under 2% added parameters.

PixIE: Prompted Pixel-Space Low-Light Image Enhancement

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

PixIE proposes a feed-forward pixel-space low-light image enhancement network using DINO-prompted pixel blocks, spatial-channel compaction, and multi-receptive-field embeddings, claiming 1.9-15.0% PSNR gains and 8.5-44.4% LPIPS reductions on benchmarks.

FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.

HDRFace: Rethinking Face Restoration with High-Dimensional Representation

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

HDRFace injects high-dimensional facial features from low-quality and intermediate images into diffusion models via SDFM fusion, reporting gains on SD V2.1-base and Qwen-Image.

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

cs.CV · 2026-05-15 · unverdicted · novelty 4.0 · 2 refs

HyperDiT reports FID 1.56 on ImageNet 256x256 using hyper-connected cross-scale attention, SA-RoPE, and VFM registers in pixel space.

Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space

cs.CV · 2026-05-07 · unverdicted · novelty 4.0

VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.

citing papers explorer

Showing 16 of 16 citing papers.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 54 · internal anchor
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
Coevolving Representations in Joint Image-Feature Diffusion cs.CV · 2026-04-19 · unverdicted · none · ref 29 · internal anchor
CoReDi coevolves semantic representations with the diffusion model via a jointly learned linear projection stabilized by stop-gradient, normalization, and regularization, yielding faster convergence and higher sample quality than fixed-representation baselines.
PixelU: A U-Shaped Transformer for Efficient End-to-End Pixel Diffusion cs.CV · 2026-06-26 · unverdicted · none · ref 29 · internal anchor
PixelU is a minimalist U-shaped Diffusion Transformer for pixel-space diffusion that decouples frequencies with zero-cost skip connections and constant-channel downsampling, outperforming baselines like JiT-G at 1/3 the compute cost with FID 1.63 on ImageNet 256x256.
BareWave: Waveform-Native Flow-Matching Text-to-Speech eess.AS · 2026-06-08 · unverdicted · none · ref 22 · internal anchor
BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.
Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry cs.CV · 2026-05-25 · unverdicted · none · ref 46 · internal anchor
MIND integrates discrete patch tokenization into diffusion score functions via soft top-k and dual-branch layers, achieving FID 22.73 (no guidance) and 2.06 (with guidance) on ImageNet-256 after 80 epochs, outperforming DiT and larger LlamaGen models.
L2P: Unlocking Latent Potential for Pixel Generation cs.CV · 2026-05-12 · unverdicted · none · ref 15 · internal anchor
L2P repurposes pre-trained LDMs for direct pixel generation via large-patch tokenization and shallow-layer training on synthetic data, matching source performance with 8-GPU training and enabling native 4K output.
FREPix: Frequency-Heterogeneous Flow Matching for Pixel-Space Image Generation cs.CV · 2026-05-07 · unverdicted · none · ref 22 · internal anchor
FREPix achieves competitive FID scores on ImageNet by decomposing image generation into separate low- and high-frequency paths within a flow matching framework.
CoD-Lite: Real-Time Diffusion-Based Generative Image Compression cs.CV · 2026-04-14 · unverdicted · none · ref 11 · internal anchor
CoD-Lite delivers real-time generative image compression via a lightweight convolution-based diffusion codec with compression-oriented pre-training and distillation, achieving substantial bitrate savings.
PixelGen: Improving Pixel Diffusion with Perceptual Supervision cs.CV · 2026-02-02 · accept · none · ref 14 · internal anchor
PixelGen augments pixel diffusion with gated perceptual supervision to reach FID 5.11 on ImageNet-256 and GenEval 0.79 in text-to-image, narrowing the gap to latent methods without VAEs.
Trajectory Forcing: Structure-First Generation with Controllable Semantic Trajectories cs.CV · 2026-06-21 · unverdicted · none · ref 33 · internal anchor
Trajectory Forcing makes generative image synthesis trajectory-centric by organizing it into decodable semantic stages derived from clustered visual representations and trained with one-step flow-matching models.
SpectralDiT: Timestep-Conditioned Spectral Residual Correction for Flow-Matching DiTs cs.CV · 2026-06-17 · unverdicted · none · ref 13 · internal anchor
SpectralDiT adds timestep-conditioned spectral residual correction to flow-matching DiTs and reports FID reductions of 5% on CIFAR-10 and 8.7% relative on ImageNet-100 latent diffusion with under 2% added parameters.
PixIE: Prompted Pixel-Space Low-Light Image Enhancement cs.CV · 2026-05-22 · unverdicted · none · ref 33 · internal anchor
PixIE proposes a feed-forward pixel-space low-light image enhancement network using DINO-prompted pixel blocks, spatial-channel compaction, and multi-receptive-field embeddings, claiming 1.9-15.0% PSNR gains and 8.5-44.4% LPIPS reductions on benchmarks.
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion cs.CV · 2026-05-18 · unverdicted · none · ref 15 · internal anchor
FrequencyBooster reports state-of-the-art FID scores of 1.60 at 256x256 and 1.69 at 512x512 for pixel diffusion by using a specialized decoder for full-frequency modeling.
HDRFace: Rethinking Face Restoration with High-Dimensional Representation cs.CV · 2026-05-14 · unverdicted · none · ref 14 · internal anchor
HDRFace injects high-dimensional facial features from low-quality and intermediate images into diffusion models via SDFM fusion, reporting gains on SD V2.1-base and Qwen-Image.
HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion cs.CV · 2026-05-15 · unverdicted · none · ref 9 · 2 links · internal anchor
HyperDiT reports FID 1.56 on ImageNet 256x256 using hyper-connected cross-scale attention, SA-RoPE, and VFM registers in pixel space.
Why Do DiT Editors Drift? Plug-and-Play Low Frequency Alignment in VAE Latent Space cs.CV · 2026-05-07 · unverdicted · none · ref 20 · internal anchor
VAE-LFA suppresses semantic drift in multi-turn DiT image editing by low-pass filtering latent discrepancies and aligning low-frequency components to an EMA of previous rounds in VAE space.

DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer