llm-d Competence Center Switzerland
Deploy, scale, and operate distributed LLM inference with llm-d on Swiss cloud infrastructure. VSHN combines deep Kubernetes expertise with platform engineering to run your llm-d workloads on APPUiO, OpenShift, enterprise private cloud, or sovereign cloud infrastructure - reliably, securely, and with full Swiss data residency.
llm-d Consulting and Architecture
Design and deploy production-grade llm-d topologies tailored to your inference workloads. VSHN architects disaggregated serving stacks with separate prefill and decode phases, optimised KV cache transfer via NIXL, and intelligent request routing. We help you choose the right model server backend, GPU allocation, and Kubernetes cluster layout for maximum throughput and minimum latency.
Intelligent Inference Scheduling
Leverage llm-d's Envoy-based routing layer for prefix-cache-aware request scheduling and multi-tenant fairness. VSHN configures production-grade inference gateways with load-aware routing, priority queues, and session affinity so your applications achieve optimal time-to-first-token while sharing GPU resources fairly across teams and workloads.
Prefill and Decode Disaggregation
Separate prefill and decode phases across dedicated GPU pools to optimise both time-to-first-token and inter-token latency independently. VSHN deploys llm-d's disaggregated architecture with NIXL-based KV cache transfer between nodes, allowing you to scale each phase independently based on your workload profile and latency targets on Kubernetes infrastructure.
Swiss Cloud and GPU Infrastructure
LLM inference, model weights, and request logs stay in Swiss data centres. VSHN operates on Exoscale, cloudscale.ch, and other Swiss cloud providers, ensuring full GDPR compliance and data residency for organisations that cannot afford to send sensitive prompts and completions to hyperscaler regions outside Switzerland.
Kubernetes-Native Operations
Run llm-d on production Kubernetes clusters with Helm charts, automated scaling, and GitOps workflows. VSHN deploys on APPUiO, Red Hat OpenShift, and enterprise Kubernetes platforms with NVIDIA device plugins, GPU resource quotas, and horizontal pod autoscaling based on inference queue depth and latency targets to optimise cost and performance.
24/7 Support and Incident Response
Monitor llm-d inference latency, throughput, token generation rates, and GPU utilisation across your entire serving fleet. VSHN integrates Prometheus, Grafana, and custom dashboards into your platform with 24/7 operations support and SLA-backed incident response, so performance issues are caught and resolved before they affect your users and applications.
Frequently Asked Questions
What is llm-d and how does it differ from vLLM?
llm-d is an open-source, Kubernetes-native distributed inference serving stack that sits above model servers like vLLM. While vLLM handles single-node model execution, llm-d adds intelligent request routing, prefill/decode disaggregation, and distributed KV cache management across multiple nodes. VSHN deploys both technologies and helps you choose the right architecture for your inference workloads based on scale and latency requirements.
What platforms does VSHN support for llm-d workloads?
VSHN deploys and operates llm-d workloads on APPUiO (our managed Kubernetes platform), Red Hat OpenShift, enterprise private cloud infrastructure, and sovereign cloud partners. All platforms run on Swiss or European data centres and are backed by our 99.99% uptime SLA. We help you choose the right platform based on your compliance, performance, and budget requirements.
Which cloud providers are available for llm-d hosting?
VSHN operates on multiple Swiss cloud providers including Exoscale and cloudscale.ch, as well as European sovereign cloud partners. For organisations that need GPU-accelerated workloads, we work with providers offering GPU instances in Swiss data centres on public and private cloud. All infrastructure is managed under a single SLA with 24/7 support from our operations team.
How does prefill/decode disaggregation improve performance?
Disaggregation separates the compute-intensive prefill phase from the memory-bound decode phase onto dedicated GPU pools. This allows each phase to be scaled independently and optimised for its specific resource profile. VSHN configures NIXL-based KV cache transfer between nodes on Kubernetes, achieving lower time-to-first-token and higher overall throughput compared to monolithic serving.
How does VSHN scope and quote llm-d consulting engagements?
Every engagement starts with a free architecture consultation where we assess your model serving needs, GPU requirements, and compliance constraints. VSHN then delivers a written scope document with a fixed-price or time-and-materials quote in CHF. Typical engagements cover cluster design, llm-d deployment, observability setup with Prometheus and Grafana, and backup automation for model artefacts and configuration data (storage from 100 GB upward). There is no commitment at the scoping stage.
How does VSHN ensure data sovereignty for llm-d workloads?
All infrastructure runs in Swiss data centres operated by Swiss or European sovereign cloud providers. Model weights, input prompts, generated completions, and inference logs never leave the chosen jurisdiction. As a VSHN Swiss Select Partner, we guarantee that all operational access is from Switzerland-based engineers, and we provide audit trails for compliance reporting.
Can llm-d integrate with existing AI pipelines?
Yes. llm-d exposes an OpenAI-compatible API through its gateway layer, so existing applications using OpenAI client libraries can switch to self-hosted models without code changes. VSHN also integrates llm-d with <a href="https://www.litellm.ch">LiteLLM</a> gateways, retrieval-augmented generation pipelines, and managed PostgreSQL with pgvector for vector storage - with automated backups and the same 99.99% SLA as all our managed services.
What monitoring and observability does VSHN provide for llm-d?
VSHN integrates Prometheus and Grafana into every managed platform, with custom dashboards for llm-d-specific metrics: inference latency (p50, p95, p99), tokens per second, GPU utilisation, queue depth, and estimated cost per request. Alerting rules notify your team and our 24/7 operations centre when metrics breach thresholds, so performance issues are caught before they affect users.
How do I get started with VSHN's llm-d services?
Contact us through the form below for an initial consultation. We assess your current model serving needs, platform requirements, and compliance constraints, then propose an architecture running on APPUiO, OpenShift, or your preferred infrastructure. Most customers go from initial consultation to a running production platform in four to six weeks.
Get in touch
Ready to run distributed LLM inference on Swiss infrastructure? Contact VSHN for a free initial consultation. We assess your requirements and propose a platform architecture tailored to your models, compliance needs, and budget.