Image & Video

Ace Step 1.5 XL ComfyUI automation workflow without lama for generating random tags using qwen, generate song and then give it a rating by using waveform analysis

A novel AI pipeline automates music creation and critique by analyzing audio waveforms as images.

Deep Dive

A developer has engineered a sophisticated, automated AI music pipeline within the ComfyUI visual programming environment. The workflow begins by using the Ace Step 1.5 XL model to generate music, but introduces a key innovation: it uses Qwen's large language model (LM) to randomize the descriptive tags or "prompts" for each generation run. This ensures varied output without manual intervention. More significantly, the system automates the critique phase, a traditionally subjective human task.

After a song is generated, the workflow segments the audio and converts it into visual waveform images. These images are then fed into Qwen's vision-language model (VL), which is prompted to subjectively analyze the waveforms and assign a letter-grade rating (e.g., A+, B). This rating is used to automatically name and sort the output files. The creator reports that songs rated A+ were perceptibly better than B-rated ones, validating the concept. Unlike server-dependent solutions like Ollama, the integrated Qwen models run directly within ComfyUI, making the entire process self-contained.

Key Points
  • Uses Qwen LM & VL models within ComfyUI to automate tag randomization and song rating, avoiding external servers.
  • Innovatively rates music by converting audio into waveform images for AI vision analysis, assigning grades like A+ or B.
  • Creates a closed-loop system where the rating directly names output files, enabling automated quality sorting.

Why It Matters

This demonstrates a move towards fully autonomous, self-evaluating AI creative systems, reducing human bottleneck in iterative generation.