I built and trained a "drawing to image" model from scratch that runs fully locally (inference on the client CPU)
A developer trained a DiT model on a single RTX 4070 GPU that now runs inference entirely in your browser.
Developer Amin has open-sourced 'tiny-models,' a novel drawing-to-image AI system that demonstrates the feasibility of high-quality, on-device generative AI. The project centers on a custom-built, small-scale Diffusion Transformer (DiT) model trained from scratch on a single consumer-grade RTX 4070 GPU. For inference, the model runs entirely locally within a web browser, leveraging the client's CPU instead of requiring cloud-based GPUs—a significant shift from standard practice.
Technically, the model implements several key innovations for efficiency. It uses flow matching instead of standard diffusion training, which led to faster convergence. User drawings are converted into per-pixel, one-hot tensors representing semantic classes (like 'sky' or 'tree'), which are then fed into the model. Crucially, it operates in pixel space, eliminating the computational overhead of an image encoder/decoder. The project also incorporates insights from the recent JiT (Joint Image-Text) paper, which argues that predicting noise (an 'off-manifold' task) is suboptimal. Instead, Amin's model is trained to predict the image directly, with the loss computed in flow velocity space, a technique that reportedly 'significantly improved' output quality.
The implications are substantial for the edge AI field. It proves that capable, interactive generative models can be designed to run on consumer hardware without internet dependency, reducing latency, cost, and privacy concerns. While the current demo on GitHub Pages is slower due to a lack of WASM multithreading support, the underlying code shows a path toward more accessible and efficient client-side AI applications.
- Trained a DiT model from scratch on a single RTX 4070 consumer GPU using flow matching for faster convergence.
- Runs inference fully locally in a web browser on the client's CPU, with no cloud API calls required.
- Implements JiT paper insights by training the model to predict images directly, improving output quality over noise prediction.
Why It Matters
Shows a viable path for privacy-preserving, low-latency generative AI that runs entirely on a user's own device, without cloud costs.