Skip to main content
Version: Next

LLMInferenceService Config Composition

A full LLM inference deployment touches container images, probes, security contexts, scheduler settings, routing rules, resource limits, and more. Most users should not have to care about all of that. KServe ships sensible defaults that cover common deployments out of the box - a working service needs a model URI, a reference to a hardware profile, and a few feature toggles like routing and scheduling. When you do need to change something - a different GPU type, a custom routing policy, a longer startup probe - you override only that field and the defaults you did not touch stay in place.

When you create an LLMInferenceService, the controller does not apply your spec directly. Instead, it builds an effective configuration by merging multiple sources together using Kubernetes strategic merge patch. This process - config composition - determines the final shape of the resources that will be responsible for reliably serving your model.

Understanding composition can be of great help when you need to debug unexpected behavior, override a default, or design reusable config fragments for your team.

Prerequisites: Please have a look at the LLMInferenceService overview and the configuration guide first.


Configuration Sources and Merge Order

Three types of configuration participate in the merge, each with a different owner and priority:

SourceOwnerPurposePriority
Well-known configsPlatform (shipped with KServe)Auto-injected by the controller based on the spec shape. Set up the llm-d stack: vLLM container, scheduler, routes, probes, volumes, sidecars.Lowest
User baseRefsUser or adminLLMInferenceServiceConfig resources referenced via spec.baseRefs. Hardware profiles (GPU types, nodeSelectors, images), org-specific defaults.Middle (ordered)
LLMInferenceService specUserThe service itself. Model URI, replicas, field overrides.Highest

All selected configs are merged in a fixed order. Each step applies a Kubernetes strategic merge patch on top of the previous result. Later values override earlier ones.

Well-known configs are not created by users. They are installed as part of KServe and automatically selected by the controller based on the spec shape (see Config Injection below). User baseRefs are LLMInferenceServiceConfig resources that you or a platform admin create and reference in spec.baseRefs - multiple baseRefs are merged in order, later entries override earlier ones. The LLMInferenceService spec itself always wins.


Example: Single-Node with Hardware baseRef

Here is what a typical deployment looks like. The user writes a short LLMInferenceService that references a hardware profile via baseRefs, picks a model, enables routing and scheduling - and overrides the profile's default replica count from 2 to 3. The controller takes care of the rest.

User provides:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: llama-3-8b
namespace: my-team
spec:
baseRefs:
- name: my-gpu-profile
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3 # overrides baseRef's replicas: 2
router:
route: {}
scheduler: {}

Controller resolves:

  1. The spec has no prefill and no worker - single-node deployment. The controller injects kserve-config-llm-template.
  2. The spec has router.scheduler defined without an external pool ref - the controller injects kserve-config-llm-scheduler.
  3. The spec has router.route defined without external route refs - the controller injects kserve-config-llm-router-route.
  4. The baseRefs reference my-gpu-profile - the controller fetches it from the kserve namespace (not found in my-team).

The sources that get merged (well-known configs are auto-injected, baseRef is resolved from the kserve namespace):

template:
  containers:
    - name: main
      image: llm-d-cuda:v0.6.0
      ports:
        - containerPort: 8000
      livenessProbe: [...]
      readinessProbe: [...]
      startupProbe: [...]
      securityContext: [...]
  volumes: [...]
+
router:
  scheduler:
    pool:
      spec:
        selector: [...]
        targetPort: 8000
    template:
      containers:
        - name: epp
          image: llm-d-inference-scheduler
          ports: [9002]
        - name: tokenizer
          image: llm-d-uds-tokenizer
          ports: [8082]
+
router:
  route:
    http:
      spec:
        parentRefs:  # from KServe ingress config
          - kind: Gateway
        rules: [...]  # 8 rules total
          # path + model-header per endpoint
          # URLRewrite filters
          # catch-all -> Service
+
baseRef: my-gpu-profile
replicas: 2
template:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-PCIE-40G
  containers:
    - name: main
      resources:
        requests:
          nvidia.com/gpu: "4"
          cpu: "8"
          memory: 64Gi
        limits:
          nvidia.com/gpu: "4"
          cpu: "16"
          memory: 128Gi
+
LLMInferenceService spec
model:
  uri: hf://meta-llama/Llama-3.1-8B-Instruct
  name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3
router:
  route: {}
  scheduler: {}
=

Effective merged result:

Effective merged configuration
model:
  uri: hf://meta-llama/Llama-3.1-8B-Instruct
  name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3                          ← spec overrides baseRef (was: 2)
template:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-PCIE-40G
  containers:
    - name: main
      image: ghcr.io/llm-d/llm-d-cuda:v0.6.0
      ports:
        - containerPort: 8000
      resources:
        requests:
          nvidia.com/gpu: "4"
          cpu: "8"
          memory: 64Gi
        limits:
          nvidia.com/gpu: "4"
          cpu: "16"
          memory: 128Gi
      livenessProbe:
        httpGet:
          path: /health
          port: 8000
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /health
          port: 8000
      startupProbe:
        httpGet:
          path: /health
          port: 8000
        failureThreshold: 120
        periodSeconds: 5
      securityContext:
        readOnlyRootFilesystem: true
        runAsNonRoot: true
  volumes:
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 1Gi
    - name: tmp
      emptyDir:
router:
  scheduler:
    pool:
      spec:
        selector: ...
        targetPort: 8000
    template:
      containers:
        - name: epp
          image: llm-d-inference-scheduler
          ports: [9002]
        - name: tokenizer
          image: llm-d-inference-scheduler
          ports: [8082]
  route:
    http:
      spec:
        parentRefs:  # from KServe ingress config
          - kind: Gateway
        rules: [...]  # 8 rules: path + model-header per endpoint, catch-all

Well-known config    User baseRef    LLMInferenceService spec


Config Injection

The controller inspects the LLMInferenceService spec and auto-injects well-known configs based on two independent criteria: the deployment pattern and the router components you enable.

Workload configs

The deployment pattern determines which workload config is injected. The controller looks at two fields - spec.prefill and spec.worker - to determine the topology:

  • Single-node: Neither prefill nor worker is set. One Deployment runs the model on a single pod. This is the simplest topology, suitable for smaller models that fit on one node.
  • Multi-node (data parallel): worker is set with data parallelism. A LeaderWorkerSet distributes inference across multiple nodes, each running a shard of the model. Used when a model is too large for a single node or when you need higher throughput.
  • Disaggregated (prefill-decode): prefill is set. Prompt processing (prefill) and token generation (decode) run as separate workloads with independent scaling. This allows heterogeneous hardware - high-FLOPS GPUs for prefill, high-bandwidth GPUs for decode. Each side can also run multi-node with worker + data parallelism.

These paths are mutually exclusive.

Your spec has...TopologyWell-known config(s) injected
No prefill, no workerSingle-nodekserve-config-llm-template
No prefill, worker + DataParallelMulti-nodekserve-config-llm-worker-data-parallel
prefill defined, no workerDisaggregated, single-node eachkserve-config-llm-prefill-template + kserve-config-llm-decode-template
prefill defined, worker + DataParallelDisaggregated, multi-node eachkserve-config-llm-prefill-worker-data-parallel + kserve-config-llm-decode-worker-data-parallel

Router configs

Router configs are injected independently and can combine with any workload config above.

Your spec has...Well-known config injected
router.scheduler without external pool refkserve-config-llm-scheduler
router.route without external route refskserve-config-llm-router-route

A single-node deployment with a managed scheduler and route will inject three configs: kserve-config-llm-template, kserve-config-llm-scheduler, and kserve-config-llm-router-route.


Strategic Merge Patch Behavior

The controller uses Kubernetes strategicpatch.StrategicMergePatch to combine configs. Here is how the merge works at the field level:

  • Non-zero fields from the override are applied to the base. Existing base fields that the override does not mention are left untouched.
  • Zero-valued fields (empty string "", 0, nil, false) in the override do not overwrite base values. This prevents a config that does not specify a port from wiping out the well-known config's port.
  • Container lists are merged by name. The main container from different sources merges into a single main container rather than creating duplicates. Other list fields may replace entirely depending on their strategic merge patch annotations.

Example: Adding Resources Without Losing Existing Fields

# Base (from well-known config)
template:
containers:
- name: main
image: ghcr.io/llm-d/llm-d-cuda:v0.6.0
ports:
- containerPort: 8000

# Override (from baseRef) - only specifies resources
template:
containers:
- name: main
resources:
limits:
nvidia.com/gpu: "1"

# Result - image and ports preserved, resources added
template:
containers:
- name: main
image: ghcr.io/llm-d/llm-d-cuda:v0.6.0 # preserved from base
ports:
- containerPort: 8000 # preserved from base
resources:
limits:
nvidia.com/gpu: "1" # added from override
tip

When designing baseRefs, only include the fields you want to set or override. There is no need to repeat fields from the well-known config - they will be preserved through the merge.


Config Namespace Resolution

For each config reference (both well-known and baseRef), the controller looks up the LLMInferenceServiceConfig in this order:

  1. The LLMInferenceService's own namespace (highest priority)
  2. The KServe system namespace (typically kserve)
  3. If not found in either namespace, the controller sets the PresetsCombined condition to False with reason ConfigNotFound

Practical Implications

Platform admins ship shared configs in the kserve namespace. These are available to all services across the cluster.

Teams can create a same-name config in their own namespace to override the shared version. For example, if the kserve namespace contains my-gpu-profile with A100 settings, a team namespace can define its own my-gpu-profile with H100 settings - the local version takes precedence for services in that namespace.

Debugging config resolution: The status.appliedConfigs field records which configs were actually used and from where. Each entry is tagged with a source field:

  • Preset - well-known config auto-injected by the controller
  • UserRef - config referenced via spec.baseRefs
status:
appliedConfigs:
- name: kserve-config-llm-template
namespace: kserve
source: Preset
- name: my-gpu-profile
namespace: my-team
source: UserRef
note

See the Status Reference for details on the PresetsCombined condition, appliedConfigs field, and troubleshooting config resolution failures.


Well-Known Config Reference

The controller ships with pre-installed LLMInferenceServiceConfig resources in the KServe system namespace. Each is injected automatically when the corresponding spec pattern is detected (see Config Injection).

Config NameInjected WhenWhat It Sets Up
kserve-config-llm-templateSingle-node (no prefill, no worker)vLLM container, probes, volumes, TLS, security context
kserve-config-llm-worker-data-parallelMulti-node + DataParallelLeader and worker templates, DP addressing, shared memory
kserve-config-llm-prefill-templateDisaggregated prefill (single-node)Prefill container
kserve-config-llm-decode-templateDisaggregated decode (single-node)Decode container, routing sidecar
kserve-config-llm-prefill-worker-data-parallelDisaggregated prefill + multi-node DPMulti-node prefill with DP addressing
kserve-config-llm-decode-worker-data-parallelDisaggregated decode + multi-node DPMulti-node decode with DP and routing sidecar
kserve-config-llm-schedulerrouter.scheduler with inline poolEndpoint Picker (EPP) deployment, tokenizer sidecar, InferencePool
kserve-config-llm-router-routerouter.route without external route refsHTTPRoute with path-based and model-header routing, URLRewrite, catch-all rules

Next Steps