Image & Video

Joy-Image-Edit released

The 24B parameter model combines an 8B MLLM with a 16B diffusion transformer for unified understanding and editing.

Deep Dive

JD.com's research division has open-sourced JoyAI-Image-Edit, a significant new multimodal foundation model that unifies image understanding, generation, and editing into a single 24-billion-parameter system. The architecture cleverly pairs an 8-billion-parameter Multimodal Large Language Model (MLLM) for comprehension with a 16-billion-parameter Multimodal Diffusion Transformer (MMDiT) for creation. This closed-loop design is the model's core innovation: the understanding module parses scenes and decomposes complex user instructions, while the generation module executes the edits, with each process continuously informing and improving the other.

Unlike standard image generators, JoyAI-Image-Edit specializes in precise, instruction-based editing of existing images. It demonstrates strong spatial reasoning, allowing users to specify edits to particular regions (e.g., "make the dog on the left bigger") or perform complex relational changes (e.g., "swap the positions of the chair and the table"). The model's ability to ground instructions in specific image areas and decompose multi-step requests makes it a powerful tool for controllable content modification, moving beyond simple text-to-image generation to more nuanced AI-assisted editing workflows. The full model, research paper, and code are available on Hugging Face, arXiv, and GitHub, respectively.

Key Points
  • Unified 24B parameter model combining an 8B MLLM for understanding with a 16B MMDiT for generation and editing.
  • Enables precise, instruction-guided edits to specific regions of an image using natural language commands.
  • Features closed-loop collaboration where spatial understanding improves editing, and generative changes provide evidence for better reasoning.

Why It Matters

This advances AI from simple image creation to precise, controllable editing, enabling more practical applications in design and content workflows.