Image & Video

Inpainting in 3 commands: remove objects or add accessories with any base model, no dedicated inpaint model needed

Remove objects or add accessories from your terminal using models like Flux or Z-Image, no dedicated inpainting model required.

Deep Dive

Modl has introduced a novel open-source command-line interface (CLI) toolkit that radically simplifies AI-powered image inpainting. The tool allows users to perform complex edits—like removing unwanted people from a street photo or adding accessories such as sunglasses to a portrait—directly from the terminal in just three commands. Crucially, it bypasses the need for dedicated inpainting models or graphical software like Photoshop, instead leveraging popular base text-to-image models such as Flux Fill Dev and Z-Image Base. The system intelligently selects between two masking strategies depending on the task: for object removal, it uses the Qwen3-VL-8B vision model to ground the target, then processes a clean silhouette with Meta's Segment Anything Model (SAM); for adding accessories, it grounds the region (e.g., "eyes") and uses an expanded bounding box to create the correct mask.

The toolkit is designed for automation and integration, with every processing step outputting structured JSON. This enables users to chain commands or, more powerfully, have a large language model (LLM) agent orchestrate the entire editing workflow. The developers caution that distilled or turbo models (like Z-Image Turbo or Flux Klein) are too compressed for coherent inpainting and recommend sticking with full base models. Future development includes a `--attach-gpu` feature to run computations on a remote GPU from a local terminal, with outputs syncing back automatically. The project, which is still in early stages, is available on GitHub under modl-org/modl, inviting community feedback.

Key Points
  • Performs inpainting in three terminal commands using base models like Flux or Z-Image, eliminating need for Photoshop or dedicated inpainting models.
  • Uses two AI-driven masking strategies: Qwen3-VL-8B + SAM for clean object removal and vision grounding with expanded bounding boxes for adding accessories.
  • Outputs JSON at every step for easy piping and LLM agent control, with future plans for remote GPU execution from a local CLI.

Why It Matters

This democratizes advanced image editing, enabling automation and integration into developer workflows and AI agent systems without specialized software.