Abstract

As cutting-edge Text-to-Image (T2I) generation models already excel at producing remarkable single images, an even more challenging task, i.e., multi-turn interactive image generation begins to attract the attention of related research communities. This task requires models to interact with users over multiple turns to generate a coherent sequence of images. However, since users may switch subjects frequently, current efforts struggle to maintain subject consistency while generating diverse images. To address this issue, we introduce a training-free multi-agent framework called AutoStudio. AutoStudio employs three agents based on large language models (LLMs) to handle interactions, along with a stable diffusion (SD) based agent for generating high-quality images. Specifically, AutoStudio consists of (i) a subject manager to interpret interaction dialogues and manage the context of each subject, (ii) a layout generator to generate fine-grained bounding boxes to control subject locations, (iii) a supervisor to provide suggestions for layout refinements, and (iv) a drawer to complete image generation. Furthermore, we introduce a Parallel-UNet to replace the original UNet in the drawer, which employs two parallel cross-attention modules for exploiting subject-aware features. We also introduce a subject-initialized generation method to better preserve small subjects. Our AutoStudio hereby can generate a sequence of multi-subject images interactively and consistently. Extensive experiments on the public CMIGBench benchmark and human evaluations show that AutoStudio maintains multi-subject consistency across multiple turns well, and it also raises the state-of-the-art performance by 13.65% in average Frechet Inception Distance and 2.83% in average character-character similarity.

Paper: https://arxiv.org/abs/2406.01388

Code: https://github.com/donahowe/AutoStudio (coming soon)

Project Page: https://howe183.github.io/AutoStudio.io/

    • felsiq@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 months ago

      I hadn’t, and it was definitely worth reading so thanks for the link. I’m still not sure exactly where I stand on the big AI companies relentlessly scraping everything for training data, but that was very convincing that copyright laws aren’t the solution (and I already believed better labour laws were needed for artists, though the details of exactly how music artists are getting shafted were new to me). Thanks for the link!

      • Even_Adder@lemmy.dbzer0.comOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        5 months ago

        Big companies own large swathes of internet content already. The worst thing that could happen to us is new laws allowing them to band together into a data cartel that would keep us from benefitting from our shared world culture in the same way they exploit it for profit. We have everything to lose, because the second it becomes profitable, they’ll burn it all down, like they do with countless live service games, entire social networks, and movies.