gocudrv uses purego to load libcuda.so at runtime, eliminating cgo and enabling cross-compilation?

gocudrv uses purego to load libcuda.so at runtime, eliminating cgo and enabling cross-compilation.

Thread affinity solved by locking an OS thread and routing all CUDA calls through a channel-based executor?

Thread affinity solved by locking an OS thread and routing all CUDA calls through a channel-based executor.

10M vector add benchmark on RTX 4070 Ti?

434μs GPU event time vs 160μs CPU timer.

Research & Papers

Go developer launches cgo-free CUDA binding gocudrv for ML workloads

r/MachineLearning May 24, 2026

⚡No cgo, no toolkit at build time – pureGo loads CUDA at runtime.

Deep Dive

A Go developer known as u/Eitamr has released gocudrv, a proof-of-concept CUDA Driver API binding for Go that completely eliminates the need for cgo. Instead, it uses purego to load libcuda.so at runtime, enabling cross-compilation and smaller Docker images – a common pain point for ML tooling. The biggest challenge was CUDA's per-thread context model conflicting with Go goroutines. The solution: a simple executor that locks an OS thread via runtime.LockOSThread and funnels all driver calls through a single channel, ensuring consistent context affinity.

Early benchmarks on an RTX 4070 Ti show a 10M-element vector add taking 434μs GPU elapsed time (via events), while CPU-side measurements reported ~160μs – a discrepancy the author attributes to GPU driver overhead. The project is still in its infancy, with support for basic memory allocation, streams, events, and kernel launches. u/Eitamr is slowly adding CUDA Graphs and multi-GPU support, and invites testing on NVIDIA 5xxx series cards. The repo is at github.com/eitamring/gocudrv, and the author notes it's more of a learning exercise than production-ready code.

Key Points

gocudrv uses purego to load libcuda.so at runtime, eliminating cgo and enabling cross-compilation.
Thread affinity solved by locking an OS thread and routing all CUDA calls through a channel-based executor.
10M vector add benchmark on RTX 4070 Ti: 434μs GPU event time vs 160μs CPU timer.

Why It Matters

A cgo-free Go CUDA binding simplifies ML tooling builds and Docker images, opening Go to GPU acceleration.

Read Original Article

Go developer launches cgo-free CUDA binding gocudrv for ML workloads

Why It Matters

Related Articles

🚀 Stay Ahead in AI