Deploying a Custom LLM in Production: Four Architectures, Only One Works

How we deployed an 8B parameter model on GCP from oversized Docker images to a low-latency Vertex AI endpoint with vLLM. Real data, real tradeoffs.