— Case Study 02
Tuning Einstein@Home
Reading the system instead of throwing hardware at it.
The setup
Einstein@Home is a distributed computing project that crunches LIGO gravitational-wave data and pulsar timing on volunteer hardware. You install BOINC, point it at the project, and your machine donates compute cycles when you're not using them. The score that matters is RAC — Recent Average Credit — a rolling-average measure of throughput per day.
I picked it as my INFT4000 (Special Topics) semester project because I had hardware I wanted to actually exercise: a Ryzen 9 9950X3D, RTX 5090 LC OC, 64GB of DDR5-6000, dual-booting Windows 11 Pro and Linux Mint 22. The thesis was straightforward — find out how high I could push this single node.
The puzzle
On paper, an RTX 5090 should be near the top of volunteer GPU leaderboards. In practice, after the initial setup on Windows, I was sitting at around 1.2M RAC/day — respectable, but not what the hardware suggested. Adding more concurrent tasks didn't help; the GPU utilization graph showed work happening, but throughput refused to scale. That gap between capability and result is what made this worth digging into.
The discovery
I started reading volunteer forums looking for setups with similar hardware. The top contributor on the Einstein@Home GPU leaderboard at the time was a Finnish volunteer running under the handle Petri33. His node was producing throughput numbers I couldn't reproduce. I reached out and asked what I was missing.
His answer wasn't about the GPU. It was about the driver model. Windows uses WDDM — the Windows Display Driver Model — which inserts the OS as a scheduler between CUDA workloads and the hardware. That scheduling overhead is fine for desktop graphics, but for compute workloads that want to keep the GPU saturated with small kernels, it adds latency at every dispatch. On Windows, the GPU was effectively waiting for permission to work. On Linux with the proprietary NVIDIA driver there's no WDDM — CUDA talks to the hardware directly — and on top of that you can run CUDA MPS (Multi-Process Service), which lets multiple BOINC processes share the GPU as if they were threads of one context. Same hardware. Different software stack. Massive throughput difference.
The bottleneck wasn't the GPU. It was the OS sitting between BOINC and the GPU.
The fix
I moved the workload to Linux Mint 22, installed the NVIDIA proprietary driver, configured CUDA MPS as the active service, and set BOINC to run two concurrent O4AS tasks. Each task finishes in ~502 seconds, which means two tasks completing roughly every eight minutes, continuously, with the GPU staying close to 100% utilization the whole time.
RAC climbed from ~1.2M/day to ~6.2M/day — roughly a 5× improvement, on identical hardware. The entire win came from removing software friction.
What this taught me
The instinct, when something underperforms, is to add hardware. The real answer here was to read the system more carefully and figure out what was getting in the way. A driver model isn't the kind of thing a profiler flags for you — it took reading the right forum thread and asking the right person. Most of my favorite debugging stories have that shape: the obvious answer is wrong, and the real one is one layer deeper than where you were looking.
The other thing I took away: the volunteer-computing community is small and generous. A top contributor took the time to explain the WDDM/MPS distinction to a student, and that's the kind of network you only get into by showing up, asking specific questions, and crediting the people who help you.
The full INFT4000 paper documenting the WDDM diagnosis, Petri33's correction, the Linux migration, and the validated 3.5× throughput improvement — with figures, command-by-command setup, and a full performance comparison table — is available as a downloadable PDF: Breaking the Concurrency Ceiling .