LLMInferenceService Config Composition
A full LLM inference deployment touches container images, probes, security contexts, scheduler settings, routing rules, resource limits, and more. Most users should not have to care about all of that. KServe ships sensible defaults that cover common deployments out of the box - a working service needs a model URI, a reference to a hardware profile, and a few feature toggles like routing and scheduling. When you do need to change something - a different GPU type, a custom routing policy, a longer startup probe - you override only that field and the defaults you did not touch stay in place.
When you create an LLMInferenceService, the controller does not apply your spec directly. Instead, it builds an effective configuration by merging multiple sources together using Kubernetes strategic merge patch. This process - config composition - determines the final shape of the resources that will be responsible for reliably serving your model.
Understanding composition can be of great help when you need to debug unexpected behavior, override a default, or design reusable config fragments for your team.
Prerequisites: Please have a look at the LLMInferenceService overview and the configuration guide first.
Configuration Sources and Merge Order
Three types of configuration participate in the merge, each with a different owner and priority:
| Source | Owner | Purpose | Priority |
|---|---|---|---|
| Well-known configs | Platform (shipped with KServe) | Auto-injected by the controller based on the spec shape. Set up the llm-d stack: vLLM container, scheduler, routes, probes, volumes, sidecars. | Lowest |
User baseRefs | User or admin | LLMInferenceServiceConfig resources referenced via spec.baseRefs. Hardware profiles (GPU types, nodeSelectors, images), org-specific defaults. | Middle (ordered) |
LLMInferenceService spec | User | The service itself. Model URI, replicas, field overrides. | Highest |
All selected configs are merged in a fixed order. Each step applies a Kubernetes strategic merge patch on top of the previous result. Later values override earlier ones.
Well-known configs are not created by users. They are installed as part of KServe and automatically selected by the controller based on the spec shape (see Config Injection below). User baseRefs are LLMInferenceServiceConfig resources that you or a platform admin create and reference in spec.baseRefs - multiple baseRefs are merged in order, later entries override earlier ones. The LLMInferenceService spec itself always wins.
Example: Single-Node with Hardware baseRef
Here is what a typical deployment looks like. The user writes a short LLMInferenceService that references a hardware profile via baseRefs, picks a model, enables routing and scheduling - and overrides the profile's default replica count from 2 to 3. The controller takes care of the rest.
User provides:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: llama-3-8b
namespace: my-team
spec:
baseRefs:
- name: my-gpu-profile
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3 # overrides baseRef's replicas: 2
router:
route: {}
scheduler: {}
Controller resolves:
- The spec has no
prefilland noworker- single-node deployment. The controller injectskserve-config-llm-template. - The spec has
router.schedulerdefined without an external pool ref - the controller injectskserve-config-llm-scheduler. - The spec has
router.routedefined without external route refs - the controller injectskserve-config-llm-router-route. - The
baseRefsreferencemy-gpu-profile- the controller fetches it from thekservenamespace (not found inmy-team).
The sources that get merged (well-known configs are auto-injected, baseRef is resolved from the kserve namespace):
template:
containers:
- name: main
image: llm-d-cuda:v0.6.0
ports:
- containerPort: 8000
livenessProbe: [...]
readinessProbe: [...]
startupProbe: [...]
securityContext: [...]
volumes: [...]router:
scheduler:
pool:
spec:
selector: [...]
targetPort: 8000
template:
containers:
- name: epp
image: llm-d-inference-scheduler
ports: [9002]
- name: tokenizer
image: llm-d-uds-tokenizer
ports: [8082]router:
route:
http:
spec:
parentRefs: # from KServe ingress config
- kind: Gateway
rules: [...] # 8 rules total
# path + model-header per endpoint
# URLRewrite filters
# catch-all -> Servicereplicas: 2
template:
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-PCIE-40G
containers:
- name: main
resources:
requests:
nvidia.com/gpu: "4"
cpu: "8"
memory: 64Gi
limits:
nvidia.com/gpu: "4"
cpu: "16"
memory: 128Gimodel:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3
router:
route: {}
scheduler: {}Effective merged result:
Config Injection
The controller inspects the LLMInferenceService spec and auto-injects well-known configs based on two independent criteria: the deployment pattern and the router components you enable.
Workload configs
The deployment pattern determines which workload config is injected. The controller looks at two fields - spec.prefill and spec.worker - to determine the topology:
- Single-node: Neither
prefillnorworkeris set. OneDeploymentruns the model on a single pod. This is the simplest topology, suitable for smaller models that fit on one node. - Multi-node (data parallel):
workeris set with data parallelism. ALeaderWorkerSetdistributes inference across multiple nodes, each running a shard of the model. Used when a model is too large for a single node or when you need higher throughput. - Disaggregated (prefill-decode):
prefillis set. Prompt processing (prefill) and token generation (decode) run as separate workloads with independent scaling. This allows heterogeneous hardware - high-FLOPS GPUs for prefill, high-bandwidth GPUs for decode. Each side can also run multi-node withworker+ data parallelism.
These paths are mutually exclusive.
| Your spec has... | Topology | Well-known config(s) injected |
|---|---|---|
No prefill, no worker | Single-node | kserve-config-llm-template |
No prefill, worker + DataParallel | Multi-node | kserve-config-llm-worker-data-parallel |
prefill defined, no worker | Disaggregated, single-node each | kserve-config-llm-prefill-template + kserve-config-llm-decode-template |
prefill defined, worker + DataParallel | Disaggregated, multi-node each | kserve-config-llm-prefill-worker-data-parallel + kserve-config-llm-decode-worker-data-parallel |
Router configs
Router configs are injected independently and can combine with any workload config above.
| Your spec has... | Well-known config injected |
|---|---|
router.scheduler without external pool ref | kserve-config-llm-scheduler |
router.route without external route refs | kserve-config-llm-router-route |
A single-node deployment with a managed scheduler and route will inject three configs: kserve-config-llm-template, kserve-config-llm-scheduler, and kserve-config-llm-router-route.
Strategic Merge Patch Behavior
The controller uses Kubernetes strategicpatch.StrategicMergePatch to combine configs. Here is how the merge works at the field level:
- Non-zero fields from the override are applied to the base. Existing base fields that the override does not mention are left untouched.
- Zero-valued fields (empty string
"",0,nil,false) in the override do not overwrite base values. This prevents a config that does not specify a port from wiping out the well-known config's port. - Container lists are merged by name. The
maincontainer from different sources merges into a singlemaincontainer rather than creating duplicates. Other list fields may replace entirely depending on their strategic merge patch annotations.
Example: Adding Resources Without Losing Existing Fields
# Base (from well-known config)
template:
containers:
- name: main
image: ghcr.io/llm-d/llm-d-cuda:v0.6.0
ports:
- containerPort: 8000
# Override (from baseRef) - only specifies resources
template:
containers:
- name: main
resources:
limits:
nvidia.com/gpu: "1"
# Result - image and ports preserved, resources added
template:
containers:
- name: main
image: ghcr.io/llm-d/llm-d-cuda:v0.6.0 # preserved from base
ports:
- containerPort: 8000 # preserved from base
resources:
limits:
nvidia.com/gpu: "1" # added from override
When designing baseRefs, only include the fields you want to set or override. There is no need to repeat fields from the well-known config - they will be preserved through the merge.
Config Namespace Resolution
For each config reference (both well-known and baseRef), the controller looks up the LLMInferenceServiceConfig in this order:
- The LLMInferenceService's own namespace (highest priority)
- The KServe system namespace (typically
kserve) - If not found in either namespace, the controller sets the
PresetsCombinedcondition toFalsewith reasonConfigNotFound
Practical Implications
Platform admins ship shared configs in the kserve namespace. These are available to all services across the cluster.
Teams can create a same-name config in their own namespace to override the shared version. For example, if the kserve namespace contains my-gpu-profile with A100 settings, a team namespace can define its own my-gpu-profile with H100 settings - the local version takes precedence for services in that namespace.
Debugging config resolution: The status.appliedConfigs field records which configs were actually used and from where. Each entry is tagged with a source field:
Preset- well-known config auto-injected by the controllerUserRef- config referenced viaspec.baseRefs
status:
appliedConfigs:
- name: kserve-config-llm-template
namespace: kserve
source: Preset
- name: my-gpu-profile
namespace: my-team
source: UserRef
See the Status Reference for details on the PresetsCombined condition, appliedConfigs field, and troubleshooting config resolution failures.
Well-Known Config Reference
The controller ships with pre-installed LLMInferenceServiceConfig resources in the KServe system namespace. Each is injected automatically when the corresponding spec pattern is detected (see Config Injection).
| Config Name | Injected When | What It Sets Up |
|---|---|---|
kserve-config-llm-template | Single-node (no prefill, no worker) | vLLM container, probes, volumes, TLS, security context |
kserve-config-llm-worker-data-parallel | Multi-node + DataParallel | Leader and worker templates, DP addressing, shared memory |
kserve-config-llm-prefill-template | Disaggregated prefill (single-node) | Prefill container |
kserve-config-llm-decode-template | Disaggregated decode (single-node) | Decode container, routing sidecar |
kserve-config-llm-prefill-worker-data-parallel | Disaggregated prefill + multi-node DP | Multi-node prefill with DP addressing |
kserve-config-llm-decode-worker-data-parallel | Disaggregated decode + multi-node DP | Multi-node decode with DP and routing sidecar |
kserve-config-llm-scheduler | router.scheduler with inline pool | Endpoint Picker (EPP) deployment, tokenizer sidecar, InferencePool |
kserve-config-llm-router-route | router.route without external route refs | HTTPRoute with path-based and model-header routing, URLRewrite, catch-all rules |
Next Steps
- Configuration Guide: Full field reference for LLMInferenceService spec
- Status Reference: Understanding conditions,
appliedConfigs, and troubleshooting - Architecture Guide: How the controller processes these configs during reconciliation