Go developer launches cgo-free CUDA binding gocudrv for ML workloads
No cgo, no toolkit at build time – pureGo loads CUDA at runtime.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A Go developer known as u/Eitamr has released gocudrv, a proof-of-concept CUDA Driver API binding for Go that completely eliminates the need for cgo. Instead, it uses purego to load libcuda.so at runtime, enabling cross-compilation and smaller Docker images – a common pain point for ML tooling. The biggest challenge was CUDA's per-thread context model conflicting with Go goroutines. The solution: a simple executor that locks an OS thread via runtime.LockOSThread and funnels all driver calls through a single channel, ensuring consistent context affinity.
Early benchmarks on an RTX 4070 Ti show a 10M-element vector add taking 434μs GPU elapsed time (via events), while CPU-side measurements reported ~160μs – a discrepancy the author attributes to GPU driver overhead. The project is still in its infancy, with support for basic memory allocation, streams, events, and kernel launches. u/Eitamr is slowly adding CUDA Graphs and multi-GPU support, and invites testing on NVIDIA 5xxx series cards. The repo is at github.com/eitamring/gocudrv, and the author notes it's more of a learning exercise than production-ready code.
- gocudrv uses purego to load libcuda.so at runtime, eliminating cgo and enabling cross-compilation.
- Thread affinity solved by locking an OS thread and routing all CUDA calls through a channel-based executor.
- 10M vector add benchmark on RTX 4070 Ti: 434μs GPU event time vs 160μs CPU timer.
Why It Matters
A cgo-free Go CUDA binding simplifies ML tooling builds and Docker images, opening Go to GPU acceleration.