Research & Papers

Trials and tribulations fine-tuning & deploying Gemma-4 [P]

A team's deep dive reveals silent training failures, broken LoRA saving, and no runtime serving for the new multimodal model.

Deep Dive

A technical deep dive by an ML team has exposed significant roadblocks in fine-tuning Google's newly released Gemma 4 model, revealing that standard open-source tools are not yet ready for its novel architecture. The primary issue stems from Google's implementation of custom layers for vision and audio projections, which break compatibility with popular Parameter-Efficient Fine-Tuning (PEFT) libraries. Specifically, the custom `ClippableLinear` class doesn't inherit from PyTorch's standard `nn.Linear`, causing PEFT to refuse to attach LoRA adapters even for text-only tasks. A manual workaround requires developers to unwrap these layers after loading the model weights.

Further down the pipeline, the team encountered a critical, silent failure in the TRL library's `SFTTrainer`, which hardcodes a setting that breaks Gemma 4's key-value sharing attention mechanism, resulting in non-converging loss and garbage gradients—a bug only fixed in the latest `transformers` v5.5.2. Even when training appears successful, using DeepSpeed ZeRO-3 can save LoRA adapters with zero-element tensors for half the layers, rendering the fine-tuned model useless. Most critically for deployment, major inference servers like vLLM and SGLang currently lack support for runtime LoRA serving with Gemma 4's multimodal setup, forcing teams to manually merge weights and remap state dictionaries before serving, a significant operational hurdle.

Key Points
  • PEFT libraries fail on Gemma 4's custom `ClippableLinear` layers, requiring manual unwrapping for LoRA fine-tuning.
  • TRL's SFTTrainer caused silent training failures due to a hardcoded setting breaking KV-sharing attention; fixed in transformers v5.5.2.
  • DeepSpeed ZeRO-3 saves corrupted, empty LoRA adapters, and no inference server (vLLM/SGLang) yet supports runtime LoRA serving for the model.

Why It Matters

These hurdles create a high barrier to entry for customizing state-of-the-art multimodal AI, slowing down practical adoption and deployment for teams.