How to Limit GPU Power and Clock Speeds for AI Inference with a Dockerised Controller
Running multiple GPUs for local AI inference at stock power settings is wasteful. Consumer cards like the RTX 3090 draw 350W by default, but AI inference workloads are typically memory-bandwidth bound, not compute bound. The GPU cores spend much of their time waiting for data, burning power for no performance