Question 1

What is llm-d and how does it differ from vLLM?

Accepted Answer

llm-d is an open-source, Kubernetes-native distributed inference serving stack that sits above model servers like vLLM. While vLLM handles single-node model execution, llm-d adds intelligent request routing, prefill/decode disaggregation, and distributed KV cache management across multiple nodes. VSHN deploys both technologies and helps you choose the right architecture for your inference workloads based on scale and latency requirements.

Question 2

What platforms does VSHN support for llm-d workloads?

Accepted Answer

VSHN deploys and operates llm-d workloads on APPUiO (our managed Kubernetes platform), Red Hat OpenShift, enterprise private cloud infrastructure, and sovereign cloud partners. All platforms run on Swiss or European data centres and are backed by our 99.99% uptime SLA. We help you choose the right platform based on your compliance, performance, and budget requirements.

Question 3

Which cloud providers are available for llm-d hosting?

Accepted Answer

VSHN operates on multiple Swiss cloud providers including Exoscale and cloudscale.ch, as well as European sovereign cloud partners. For organisations that need GPU-accelerated workloads, we work with providers offering GPU instances in Swiss data centres on public and private cloud. All infrastructure is managed under a single SLA with 24/7 support from our operations team.

Question 4

How does prefill/decode disaggregation improve performance?

Accepted Answer

Disaggregation separates the compute-intensive prefill phase from the memory-bound decode phase onto dedicated GPU pools. This allows each phase to be scaled independently and optimised for its specific resource profile. VSHN configures NIXL-based KV cache transfer between nodes on Kubernetes, achieving lower time-to-first-token and higher overall throughput compared to monolithic serving.

Question 5

How does VSHN scope and quote llm-d consulting engagements?

Accepted Answer

Every engagement starts with a free architecture consultation where we assess your model serving needs, GPU requirements, and compliance constraints. VSHN then delivers a written scope document with a fixed-price or time-and-materials quote in CHF. Typical engagements cover cluster design, llm-d deployment, observability setup with Prometheus and Grafana, and backup automation for model artefacts and configuration data (storage from 100 GB upward). There is no commitment at the scoping stage.

Question 6

How does VSHN ensure data sovereignty for llm-d workloads?

Accepted Answer

All infrastructure runs in Swiss data centres operated by Swiss or European sovereign cloud providers. Model weights, input prompts, generated completions, and inference logs never leave the chosen jurisdiction. All operational access is from Switzerland-based engineers, and we provide audit trails for compliance reporting.

Question 7

Can llm-d integrate with existing AI pipelines?

Accepted Answer

Yes. llm-d exposes an OpenAI-compatible API through its gateway layer, so existing applications using OpenAI client libraries can switch to self-hosted models without code changes. VSHN also integrates llm-d with [LiteLLM](https://www.litellm.ch) gateways, retrieval-augmented generation pipelines, and managed PostgreSQL with pgvector for vector storage - with automated backups and the same 99.99% SLA as all our managed services.

Question 8

What monitoring and observability does VSHN provide for llm-d?

Accepted Answer

VSHN integrates Prometheus and Grafana into every managed platform, with custom dashboards for llm-d-specific metrics: inference latency (p50, p95, p99), tokens per second, GPU utilisation, queue depth, and estimated cost per request. Alerting rules notify your team and our 24/7 operations centre when metrics breach thresholds, so performance issues are caught before they affect users.

Question 9

How do I get started with VSHN's llm-d services?

Accepted Answer

Contact us through the form below for an initial consultation. We assess your current model serving needs, platform requirements, and compliance constraints, then propose an architecture running on APPUiO, OpenShift, or your preferred infrastructure. Most customers go from initial consultation to a running production platform in four to six weeks.

llm-d Competence Center Switzerland

llm-d Consulting and Architecture

Intelligent Inference Scheduling

Prefill and Decode Disaggregation

Swiss Cloud and GPU Infrastructure

Kubernetes-Native Operations

24/7 Support and Incident Response

llm-d FAQ

Book a consultation