Version: Next

WVA Autoscaling Examples

This page provides complete, ready-to-use YAML manifests for common WVA autoscaling configurations. For an overview of WVA autoscaling concepts, configuration options, and prerequisites, see the Autoscaling Guide.

HPA with Deployment

Single-node deployment with HPA-based autoscaling and custom scaling behavior:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama-hpa
  namespace: default
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-8B-Instruct
    name: meta-llama/Llama-3.1-8B-Instruct

  scaling:
    minReplicas: 1
    maxReplicas: 5
    wva:
      variantCost: "10.0"
      hpa:
        behavior:
          scaleUp:
            stabilizationWindowSeconds: 0
          scaleDown:
            stabilizationWindowSeconds: 300

  template:
    containers:
      - name: main
        image: vllm/vllm-openai:latest
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "8"
            memory: 32Gi

  router:
    gateway: {}
    route: {}
    scheduler: {}

Verify the created resources:

# Check VariantAutoscaling
kubectl get variantautoscalings llama-hpa-kserve-va

# Check HPA
kubectl get hpa llama-hpa-kserve-hpa

# Check scaling status
kubectl get llminferenceservice llama-hpa -o jsonpath='{.status.conditions[?(@.type=="ScalingReady")]}'

KEDA with Deployment

Single-node deployment with KEDA-based autoscaling, including idle scale-down and metric fallback:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama-keda
  namespace: default
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-8B-Instruct
    name: meta-llama/Llama-3.1-8B-Instruct

  scaling:
    minReplicas: 2
    maxReplicas: 8
    wva:
      variantCost: "10.0"
      keda:
        pollingInterval: 5
        cooldownPeriod: 120
        initialCooldownPeriod: 60
        idleReplicaCount: 1
        fallback:
          failureThreshold: 3
          replicas: 2

  template:
    containers:
      - name: main
        image: vllm/vllm-openai:latest
        resources:
          limits:
            nvidia.com/gpu: "1"
            cpu: "8"
            memory: 32Gi

  router:
    gateway: {}
    route: {}
    scheduler: {}

Verify the created resources:

# Check VariantAutoscaling
kubectl get variantautoscalings llama-keda-kserve-va

# Check KEDA ScaledObject
kubectl get scaledobjects llama-keda-kserve-keda

# Check scaling status
kubectl get llminferenceservice llama-keda -o jsonpath='{.status.conditions[?(@.type=="ScalingReady")]}'

Multi-Node (LeaderWorkerSet) with HPA

For multi-node deployments using LeaderWorkerSet, WVA scales the LWS resource directly:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama-70b-multinode
  namespace: default
spec:
  model:
    uri: hf://meta-llama/Llama-2-70b-hf
    name: meta-llama/Llama-2-70b-hf

  parallelism:
    tensor: 4
    data: 8
    dataLocal: 4

  scaling:
    minReplicas: 1
    maxReplicas: 4
    wva:
      hpa: {}

  template:
    containers:
      - name: main
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "/mnt/models"
          - "--tensor-parallel-size"
          - "4"
        resources:
          limits:
            nvidia.com/gpu: "4"
            cpu: "16"
            memory: 128Gi

  worker:
    containers:
      - name: main
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "/mnt/models"
          - "--tensor-parallel-size"
          - "4"
        resources:
          limits:
            nvidia.com/gpu: "4"
            cpu: "16"
            memory: 128Gi

  router:
    gateway: {}
    route: {}
    scheduler: {}

note

When spec.worker is present, WVA targets the LeaderWorkerSet resource instead of a Deployment.

Multi-Node (LeaderWorkerSet) with KEDA

Multi-node deployment with KEDA-based autoscaling, leveraging idle scale-down and initial cooldown for large model loading:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama-70b-multinode-keda
  namespace: default
spec:
  model:
    uri: hf://meta-llama/Llama-2-70b-hf
    name: meta-llama/Llama-2-70b-hf

  parallelism:
    tensor: 4
    data: 8
    dataLocal: 4

  scaling:
    minReplicas: 2
    maxReplicas: 6
    wva:
      variantCost: "20.0"
      keda:
        pollingInterval: 10
        cooldownPeriod: 300
        initialCooldownPeriod: 120
        idleReplicaCount: 1
        fallback:
          failureThreshold: 3
          replicas: 2

  template:
    containers:
      - name: main
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "/mnt/models"
          - "--tensor-parallel-size"
          - "4"
        resources:
          limits:
            nvidia.com/gpu: "4"
            cpu: "16"
            memory: 128Gi

  worker:
    containers:
      - name: main
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "/mnt/models"
          - "--tensor-parallel-size"
          - "4"
        resources:
          limits:
            nvidia.com/gpu: "4"
            cpu: "16"
            memory: 128Gi

  router:
    gateway: {}
    route: {}
    scheduler: {}

Verify the created resources:

# Check VariantAutoscaling
kubectl get variantautoscalings llama-70b-multinode-keda-kserve-va

# Check KEDA ScaledObject (targets LeaderWorkerSet)
kubectl get scaledobjects llama-70b-multinode-keda-kserve-keda

# Check scaling status
kubectl get llminferenceservice llama-70b-multinode-keda -o jsonpath='{.status.conditions[?(@.type=="ScalingReady")]}'

tip

For large multi-node models, set initialCooldownPeriod to account for model loading time across all nodes. This prevents KEDA from making premature scaling decisions before the model is ready to serve traffic.

HPA with Deployment​

KEDA with Deployment​

Multi-Node (LeaderWorkerSet) with HPA​

Multi-Node (LeaderWorkerSet) with KEDA​

HPA with Deployment

KEDA with Deployment

Multi-Node (LeaderWorkerSet) with HPA

Multi-Node (LeaderWorkerSet) with KEDA