Abstract
This technical report introduces PIXART-{\delta}, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-{\delta} significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta} achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images, marking a 7x improvement over the PIXART-{\alpha}. Additionally, PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability (von Platen et al., 2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.
Paper: https://arxiv.org/abs/2401.05252
Code: https://github.com/PixArt-alpha/PixArt-alpha
Demo: https://huggingface.co/spaces/PixArt-alpha/PixArt-LCM
Project Page: https://pixart-alpha.github.io/
I wonder if this model would see an increase in quality if it had the same quantity of data as something like stable diffusion. It is impressive, but I’ve found that it lacks knowledge of some things that stable diffusion understands.
Yeah, some fine-tunes of this would be great, It’d be nice to have other options in base model other than Stable Diffusion.
New Lemmy Post: PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (https://lemmy.dbzer0.com/post/12140089)
Tagging: #StableDiffusion(Replying in the OP of this thread (NOT THIS BOT!) will appear as a comment in the lemmy discussion.)
I am a FOSS bot. Check my README: https://github.com/db0/lemmy-tagginator/blob/main/README.md