Version: Next

LLMInferenceService Config Composition

A full LLM inference deployment touches container images, probes, security contexts, scheduler settings, routing rules, resource limits, and more. Most users should not have to care about all of that. KServe ships sensible defaults that cover common deployments out of the box - a working service needs a model URI, a reference to a hardware profile, and a few feature toggles like routing and scheduling. When you do need to change something - a different GPU type, a custom routing policy, a longer startup probe - you override only that field and the defaults you did not touch stay in place.

When you create an LLMInferenceService, the controller does not apply your spec directly. Instead, it builds an effective configuration by merging multiple sources together using Kubernetes strategic merge patch. This process - config composition - determines the final shape of the resources that will be responsible for reliably serving your model.

Understanding composition can be of great help when you need to debug unexpected behavior, override a default, or design reusable config fragments for your team.

Prerequisites: Please have a look at the LLMInferenceService overview and the configuration guide first.

Configuration Sources and Merge Order

Three types of configuration participate in the merge, each with a different owner and priority:

Source	Owner	Purpose	Priority
Well-known configs	Platform (shipped with KServe)	Auto-injected by the controller based on the spec shape. Set up the llm-d stack: vLLM container, scheduler, routes, probes, volumes, sidecars.	Lowest
User `baseRefs`	User or admin	`LLMInferenceServiceConfig` resources referenced via `spec.baseRefs`. Hardware profiles (GPU types, nodeSelectors, images), org-specific defaults.	Middle (ordered)
`LLMInferenceService` spec	User	The service itself. Model URI, replicas, field overrides.	Highest

All selected configs are merged in a fixed order. Each step applies a Kubernetes strategic merge patch on top of the previous result. Later values override earlier ones.

Well-known configs are not created by users. They are installed as part of KServe and automatically selected by the controller based on the spec shape (see Config Injection below). User baseRefs are LLMInferenceServiceConfig resources that you or a platform admin create and reference in spec.baseRefs - multiple baseRefs are merged in order, later entries override earlier ones. The LLMInferenceService spec itself always wins.

Example: Single-Node with Hardware baseRef

Here is what a typical deployment looks like. The user writes a short LLMInferenceService that references a hardware profile via baseRefs, picks a model, enables routing and scheduling - and overrides the profile's default replica count from 2 to 3. The controller takes care of the rest.

User provides:

apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
  name: llama-3-8b
  namespace: my-team
spec:
  baseRefs:
    - name: my-gpu-profile
  model:
    uri: hf://meta-llama/Llama-3.1-8B-Instruct
    name: meta-llama/Llama-3.1-8B-Instruct
  replicas: 3  # overrides baseRef's replicas: 2
  router:
    route: {}
    scheduler: {}

Controller resolves:

The spec has no prefill and no worker - single-node deployment. The controller injects kserve-config-llm-template.
The spec has router.scheduler defined without an external pool ref - the controller injects kserve-config-llm-scheduler.
The spec has router.route defined without external route refs - the controller injects kserve-config-llm-router-route.
The baseRefs reference my-gpu-profile - the controller fetches it from the kserve namespace (not found in my-team).

The sources that get merged (well-known configs are auto-injected, baseRef is resolved from the kserve namespace):

kserve-config-llm-template

template:
  containers:
    - name: main
      image: llm-d-cuda:v0.6.0
      ports:
        - containerPort: 8000
      livenessProbe: [...]
      readinessProbe: [...]
      startupProbe: [...]
      securityContext: [...]
  volumes: [...]

kserve-config-llm-scheduler

router:
  scheduler:
    pool:
      spec:
        selector: [...]
        targetPort: 8000
    template:
      containers:
        - name: epp
          image: llm-d-inference-scheduler
          ports: [9002]
        - name: tokenizer
          image: llm-d-uds-tokenizer
          ports: [8082]

kserve-config-llm-router-route

router:
  route:
    http:
      spec:
        parentRefs:  # from KServe ingress config
          - kind: Gateway
        rules: [...]  # 8 rules total
          # path + model-header per endpoint
          # URLRewrite filters
          # catch-all -> Service

baseRef: my-gpu-profile

replicas: 2
template:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-PCIE-40G
  containers:
    - name: main
      resources:
        requests:
          nvidia.com/gpu: "4"
          cpu: "8"
          memory: 64Gi
        limits:
          nvidia.com/gpu: "4"
          cpu: "16"
          memory: 128Gi

LLMInferenceService spec

model:
  uri: hf://meta-llama/Llama-3.1-8B-Instruct
  name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3
router:
  route: {}
  scheduler: {}

Effective merged result:

Effective merged configuration

model:
  uri: hf://meta-llama/Llama-3.1-8B-Instruct
  name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3                          ← spec overrides baseRef (was: 2)
template:
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-PCIE-40G
  containers:
    - name: main
      image: ghcr.io/llm-d/llm-d-cuda:v0.6.0
      ports:
        - containerPort: 8000
      resources:
        requests:
          nvidia.com/gpu: "4"
          cpu: "8"
          memory: 64Gi
        limits:
          nvidia.com/gpu: "4"
          cpu: "16"
          memory: 128Gi
      livenessProbe:
        httpGet:
          path: /health
          port: 8000
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /health
          port: 8000
      startupProbe:
        httpGet:
          path: /health
          port: 8000
        failureThreshold: 120
        periodSeconds: 5
      securityContext:
        readOnlyRootFilesystem: true
        runAsNonRoot: true
  volumes:
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 1Gi
    - name: tmp
      emptyDir:
router:
  scheduler:
    pool:
      spec:
        selector: ...
        targetPort: 8000
    template:
      containers:
        - name: epp
          image: llm-d-inference-scheduler
          ports: [9002]
        - name: tokenizer
          image: llm-d-inference-scheduler
          ports: [8082]
  route:
    http:
      spec:
        parentRefs:  # from KServe ingress config
          - kind: Gateway
        rules: [...]  # 8 rules: path + model-header per endpoint, catch-all

█ Well-known config █ User baseRef █ LLMInferenceService spec

Config Injection

The controller inspects the LLMInferenceService spec and auto-injects well-known configs based on two independent criteria: the deployment pattern and the router components you enable.

Workload configs

The deployment pattern determines which workload config is injected. The controller looks at two fields - spec.prefill and spec.worker - to determine the topology:

Single-node: Neither prefill nor worker is set. One Deployment runs the model on a single pod. This is the simplest topology, suitable for smaller models that fit on one node.
Multi-node (data parallel): worker is set with data parallelism. A LeaderWorkerSet distributes inference across multiple nodes, each running a shard of the model. Used when a model is too large for a single node or when you need higher throughput.
Disaggregated (prefill-decode): prefill is set. Prompt processing (prefill) and token generation (decode) run as separate workloads with independent scaling. This allows heterogeneous hardware - high-FLOPS GPUs for prefill, high-bandwidth GPUs for decode. Each side can also run multi-node with worker + data parallelism.

These paths are mutually exclusive.

Your spec has...	Topology	Well-known config(s) injected
No `prefill`, no `worker`	Single-node	`kserve-config-llm-template`
No `prefill`, `worker` + DataParallel	Multi-node	`kserve-config-llm-worker-data-parallel`
`prefill` defined, no `worker`	Disaggregated, single-node each	`kserve-config-llm-prefill-template` + `kserve-config-llm-decode-template`
`prefill` defined, `worker` + DataParallel	Disaggregated, multi-node each	`kserve-config-llm-prefill-worker-data-parallel` + `kserve-config-llm-decode-worker-data-parallel`

Router configs

Router configs are injected independently and can combine with any workload config above.

Your spec has...	Well-known config injected
`router.scheduler` without external pool ref	`kserve-config-llm-scheduler`
`router.route` without external route refs	`kserve-config-llm-router-route`

A single-node deployment with a managed scheduler and route will inject three configs: kserve-config-llm-template, kserve-config-llm-scheduler, and kserve-config-llm-router-route.

Strategic Merge Patch Behavior

The controller uses Kubernetes strategicpatch.StrategicMergePatch to combine configs. Here is how the merge works at the field level:

Non-zero fields from the override are applied to the base. Existing base fields that the override does not mention are left untouched.
Zero-valued fields (empty string "", 0, nil, false) in the override do not overwrite base values. This prevents a config that does not specify a port from wiping out the well-known config's port.
Container lists are merged by name. The main container from different sources merges into a single main container rather than creating duplicates. Other list fields may replace entirely depending on their strategic merge patch annotations.

Example: Adding Resources Without Losing Existing Fields

# Base (from well-known config)
template:
  containers:
    - name: main
      image: ghcr.io/llm-d/llm-d-cuda:v0.6.0
      ports:
        - containerPort: 8000

# Override (from baseRef) - only specifies resources
template:
  containers:
    - name: main
      resources:
        limits:
          nvidia.com/gpu: "1"

# Result - image and ports preserved, resources added
template:
  containers:
    - name: main
      image: ghcr.io/llm-d/llm-d-cuda:v0.6.0   # preserved from base
      ports:
        - containerPort: 8000                     # preserved from base
      resources:
        limits:
          nvidia.com/gpu: "1"                     # added from override

tip

When designing baseRefs, only include the fields you want to set or override. There is no need to repeat fields from the well-known config - they will be preserved through the merge.

Config Namespace Resolution

For each config reference (both well-known and baseRef), the controller looks up the LLMInferenceServiceConfig in this order:

The LLMInferenceService's own namespace (highest priority)
The KServe system namespace (typically kserve)
If not found in either namespace, the controller sets the PresetsCombined condition to False with reason ConfigNotFound

Practical Implications

Platform admins ship shared configs in the kserve namespace. These are available to all services across the cluster.

Teams can create a same-name config in their own namespace to override the shared version. For example, if the kserve namespace contains my-gpu-profile with A100 settings, a team namespace can define its own my-gpu-profile with H100 settings - the local version takes precedence for services in that namespace.

Debugging config resolution: The status.appliedConfigs field records which configs were actually used and from where. Each entry is tagged with a source field:

Preset - well-known config auto-injected by the controller
UserRef - config referenced via spec.baseRefs

status:
  appliedConfigs:
    - name: kserve-config-llm-template
      namespace: kserve
      source: Preset
    - name: my-gpu-profile
      namespace: my-team
      source: UserRef

note

See the Status Reference for details on the PresetsCombined condition, appliedConfigs field, and troubleshooting config resolution failures.

Well-Known Config Reference

The controller ships with pre-installed LLMInferenceServiceConfig resources in the KServe system namespace. Each is injected automatically when the corresponding spec pattern is detected (see Config Injection).

Config Name	Injected When	What It Sets Up
`kserve-config-llm-template`	Single-node (no prefill, no worker)	vLLM container, probes, volumes, TLS, security context
`kserve-config-llm-worker-data-parallel`	Multi-node + DataParallel	Leader and worker templates, DP addressing, shared memory
`kserve-config-llm-prefill-template`	Disaggregated prefill (single-node)	Prefill container
`kserve-config-llm-decode-template`	Disaggregated decode (single-node)	Decode container, routing sidecar
`kserve-config-llm-prefill-worker-data-parallel`	Disaggregated prefill + multi-node DP	Multi-node prefill with DP addressing
`kserve-config-llm-decode-worker-data-parallel`	Disaggregated decode + multi-node DP	Multi-node decode with DP and routing sidecar
`kserve-config-llm-scheduler`	`router.scheduler` with inline pool	Endpoint Picker (EPP) deployment, tokenizer sidecar, InferencePool
`kserve-config-llm-router-route`	`router.route` without external route refs	HTTPRoute with path-based and model-header routing, URLRewrite, catch-all rules

Next Steps

Configuration Guide: Full field reference for LLMInferenceService spec
Status Reference: Understanding conditions, appliedConfigs, and troubleshooting
Architecture Guide: How the controller processes these configs during reconciliation

Configuration Sources and Merge Order​

Example: Single-Node with Hardware baseRef​

Config Injection​

Workload configs​

Router configs​

Strategic Merge Patch Behavior​

Example: Adding Resources Without Losing Existing Fields​

Config Namespace Resolution​

Practical Implications​

Well-Known Config Reference​

Next Steps​