Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both.
Last quarter I helped a large enterprise size a Kubernetes cluster for real-time inference on their customer-facing LLM product. We started with 64 H100 SXM GPUs across 8 nodes, all running vLLM in monolithic mode. The results were not where we need them …