• Stars
    star
    3,444
  • Rank 12,366 (Top 0.3 %)
  • Language
    Go
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Descheduler for Kubernetes

Go Report Card Release Charts

↖️ Click at the [bullet list icon] at the top left corner of the Readme visualization for the github generated table of contents.

descheduler

Descheduler for Kubernetes

Scheduling in Kubernetes is the process of binding pending pods to nodes, and is performed by a component of Kubernetes called kube-scheduler. The scheduler's decisions, whether or where a pod can or can not be scheduled, are guided by its configurable policy which comprises of set of rules, called predicates and priorities. The scheduler's decisions are influenced by its view of a Kubernetes cluster at that point of time when a new pod appears for scheduling. As Kubernetes clusters are very dynamic and their state changes over time, there may be desire to move already running pods to some other nodes for various reasons:

  • Some nodes are under or over utilized.
  • The original scheduling decision does not hold true any more, as taints or labels are added to or removed from nodes, pod/node affinity requirements are not satisfied any more.
  • Some nodes failed and their pods moved to other nodes.
  • New nodes are added to clusters.

Consequently, there might be several pods scheduled on less desired nodes in a cluster. Descheduler, based on its policy, finds pods that can be moved and evicts them. Please note, in current implementation, descheduler does not schedule replacement of evicted pods but relies on the default scheduler for that.

⚠️ Documentation Versions by Release

If you are using a published release of Descheduler (such as registry.k8s.io/descheduler/descheduler:v0.26.1), follow the documentation in that version's release branch, as listed below:

Descheduler Version Docs link
v0.27.x release-1.27
v0.26.x release-1.26
v0.25.x release-1.25
v0.24.x release-1.24

The master branch is considered in-development and the information presented in it may not work for previous versions.

Quick Start

The descheduler can be run as a Job, CronJob, or Deployment inside of a k8s cluster. It has the advantage of being able to be run multiple times without needing user intervention. The descheduler pod is run as a critical pod in the kube-system namespace to avoid being evicted by itself or by the kubelet.

Run As A Job

kubectl create -f kubernetes/base/rbac.yaml
kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/job/job.yaml

Run As A CronJob

kubectl create -f kubernetes/base/rbac.yaml
kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/cronjob/cronjob.yaml

Run As A Deployment

kubectl create -f kubernetes/base/rbac.yaml
kubectl create -f kubernetes/base/configmap.yaml
kubectl create -f kubernetes/deployment/deployment.yaml

Install Using Helm

Starting with release v0.18.0 there is an official helm chart that can be used to install the descheduler. See the helm chart README for detailed instructions.

The descheduler helm chart is also listed on the artifact hub.

Install Using Kustomize

You can use kustomize to install descheduler. See the resources | Kustomize for detailed instructions.

Run As A Job

kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/job?ref=v0.26.1' | kubectl apply -f -

Run As A CronJob

kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/cronjob?ref=v0.26.1' | kubectl apply -f -

Run As A Deployment

kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/deployment?ref=v0.26.1' | kubectl apply -f -

User Guide

See the user guide in the /docs directory.

Policy, Default Evictor and Strategy plugins

⚠️ v1alpha1 configuration is still suported, but deprecated (and soon will be removed). Please consider migrating to v1alpha2 (described bellow). For previous v1alpha1 documentation go to docs/deprecated/v1alpha1.md ⚠️

The Descheduler Policy is configurable and includes default strategy plugins that can be enabled or disabled. It includes a common eviction configuration at the top level, as well as configuration from the Evictor plugin (Default Evictor, if not specified otherwise). Top-level configuration and Evictor plugin configuration are applied to all evictions.

Top Level configuration

These are top level keys in the Descheduler Policy that you can use to configure all evictions.

Name type Default Value Description
nodeSelector string nil limiting the nodes which are processed. Only used when nodeFit=true and only by the PreEvictionFilter Extension Point
maxNoOfPodsToEvictPerNode int nil maximum number of pods evicted from each node (summed through all strategies)
maxNoOfPodsToEvictPerNamespace int nil maximum number of pods evicted from each namespace (summed through all strategies)

Evictor Plugin configuration (Default Evictor)

The Default Evictor Plugin is used by default for filtering pods before processing them in an strategy plugin, or for applying a PreEvictionFilter of pods before eviction. You can also create your own Evictor Plugin or use the Default one provided by Descheduler. Other uses for the Evictor plugin can be to sort, filter, validate or group pods by different criteria, and that's why this is handled by a plugin and not configured in the top level config.

Name type Default Value Description
nodeSelector string nil limiting the nodes which are processed
evictLocalStoragePods bool false allows eviction of pods with local storage
evictSystemCriticalPods bool false [Warning: Will evict Kubernetes system pods] allows eviction of pods with any priority, including system pods like kube-dns
ignorePvcPods bool false set whether PVC pods should be evicted or ignored
evictFailedBarePods bool false allow eviction of pods without owner references and in failed phase
labelSelector metav1.LabelSelector (see label filtering)
priorityThreshold priorityThreshold (see priority filtering)
nodeFit bool false (see node fit filtering)

Example policy

As part of the policy, you will start deciding which top level configuration to use, then which Evictor plugin to use (if you have your own, the Default Evictor if not), followed by deciding the configuration passed to the Evictor Plugin. By default, the Default Evictor is enabled for both filter and preEvictionFilter extension points. After that you will enable/disable eviction strategies plugins and configure them properly.

See each strategy plugin section for details on available parameters.

Policy:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
nodeSelector: "node=node1" # you don't need to set this, if not set all will be processed
maxNoOfPodsToEvictPerNode: 5000 # you don't need to set this, unlimited if not set
maxNoOfPodsToEvictPerNamespace: 5000 # you don't need to set this, unlimited if not set
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        evictSystemCriticalPods: true
        evictFailedBarePods: true
        evictLocalStoragePods: true
        nodeFit: true
    plugins:
      # DefaultEvictor is enabled for both `filter` and `preEvictionFilter`
      # filter:
      #   enabled:
      #     - "DefaultEvictor"
      # preEvictionFilter:
      #   enabled:
      #     - "DefaultEvictor"
      deschedule:
        enabled:
          - ...
      balance:
        enabled:
          - ...
      [...]

The following diagram provides a visualization of most of the strategies to help categorize how strategies fit together.

Strategies diagram

The following sections provide an overview of the different strategy plugins available. These plugins are grouped based on their implementation of extension points: Deschedule or Balance.

Deschedule Plugins: These plugins process pods one by one, and evict them in a sequential manner.

Balance Plugins: These plugins process all pods, or groups of pods, and determine which pods to evict based on how the group was intended to be spread.

Name Extension Point Implemented Description
RemoveDuplicates Balance Spreads replicas
LowNodeUtilization Balance Spreads pods according to pods resource requests and node resources available
HighNodeUtilization Balance Spreads pods according to pods resource requests and node resources available
RemovePodsViolatingInterPodAntiAffinity Deschedule Evicts pods violating pod anti affinity
RemovePodsViolatingNodeAffinity Deschedule Evicts pods violating node affinity
RemovePodsViolatingNodeTaints Deschedule Evicts pods violating node taints
RemovePodsViolatingTopologySpreadConstraint Balance Evicts pods violating TopologySpreadConstraints
PodLifeTime Deschedule Evicts pods that have exceeded a specified age limit
RemoveFailedPods Deschedule Evicts pods with certain failed reasons

RemoveDuplicates

This strategy plugin makes sure that there is only one pod associated with a ReplicaSet (RS), ReplicationController (RC), StatefulSet, or Job running on the same node. If there are more, those duplicate pods are evicted for better spreading of pods in a cluster. This issue could happen if some nodes went down due to whatever reasons, and pods on them were moved to other nodes leading to more than one pod associated with a RS or RC, for example, running on the same node. Once the failed nodes are ready again, this strategy could be enabled to evict those duplicate pods.

It provides one optional parameter, excludeOwnerKinds, which is a list of OwnerRef Kinds. If a pod has any of these Kinds listed as an OwnerRef, that pod will not be considered for eviction. Note that pods created by Deployments are considered for eviction by this strategy. The excludeOwnerKinds parameter should include ReplicaSet to have pods created by Deployments excluded.

Parameters:

Name Type
excludeOwnerKinds list(string)
namespaces (see namespace filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemoveDuplicates"
      args:
        excludeOwnerKinds:
          - "ReplicaSet"
    plugins:
      balance:
        enabled:
          - "RemoveDuplicates"

LowNodeUtilization

This strategy finds nodes that are under utilized and evicts pods, if possible, from other nodes in the hope that recreation of evicted pods will be scheduled on these underutilized nodes. The parameters of this strategy are configured under nodeResourceUtilizationThresholds.

The under utilization of nodes is determined by a configurable threshold thresholds. The threshold thresholds can be configured for cpu, memory, number of pods, and extended resources in terms of percentage (the percentage is calculated as the current resources requested on the node vs total allocatable. For pods, this means the number of pods on the node as a fraction of the pod capacity set for that node).

If a node's usage is below threshold for all (cpu, memory, number of pods and extended resources), the node is considered underutilized. Currently, pods request resource requirements are considered for computing node resource utilization.

There is another configurable threshold, targetThresholds, that is used to compute those potential nodes from where pods could be evicted. If a node's usage is above targetThreshold for any (cpu, memory, number of pods, or extended resources), the node is considered over utilized. Any node between the thresholds, thresholds and targetThresholds is considered appropriately utilized and is not considered for eviction. The threshold, targetThresholds, can be configured for cpu, memory, and number of pods too in terms of percentage.

These thresholds, thresholds and targetThresholds, could be tuned as per your cluster requirements. Note that this strategy evicts pods from overutilized nodes (those with usage above targetThresholds) to underutilized nodes (those with usage below thresholds), it will abort if any number of underutilized nodes or overutilized nodes is zero.

Additionally, the strategy accepts a useDeviationThresholds parameter. If that parameter is set to true, the thresholds are considered as percentage deviations from mean resource usage. thresholds will be deducted from the mean among all nodes and targetThresholds will be added to the mean. A resource consumption above (resp. below) this window is considered as overutilization (resp. underutilization).

NOTE: Node resource consumption is determined by the requests and limits of pods, not actual usage. This approach is chosen in order to maintain consistency with the kube-scheduler, which follows the same design for scheduling pods onto nodes. This means that resource usage as reported by Kubelet (or commands like kubectl top) may differ from the calculated consumption, due to these components reporting actual usage metrics. Implementing metrics-based descheduling is currently TODO for the project.

Parameters:

Name Type
useDeviationThresholds bool
thresholds map(string:int)
targetThresholds map(string:int)
numberOfNodes int
evictableNamespaces (see namespace filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "LowNodeUtilization"
      args:
        thresholds:
          "cpu" : 20
          "memory": 20
          "pods": 20
        targetThresholds:
          "cpu" : 50
          "memory": 50
          "pods": 50
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"

Policy should pass the following validation checks:

  • Three basic native types of resources are supported: cpu, memory and pods. If any of these resource types is not specified, all its thresholds default to 100% to avoid nodes going from underutilized to overutilized.
  • Extended resources are supported. For example, resource type nvidia.com/gpu is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in thresholds and targetThresholds explicitly.
  • thresholds or targetThresholds can not be nil and they must configure exactly the same types of resources.
  • The valid range of the resource's percentage value is [0, 100]
  • Percentage value of thresholds can not be greater than targetThresholds for the same resource.

There is another parameter associated with the LowNodeUtilization strategy, called numberOfNodes. This parameter can be configured to activate the strategy only when the number of under utilized nodes are above the configured value. This could be helpful in large clusters where a few nodes could go under utilized frequently or for a short period of time. By default, numberOfNodes is set to zero.

HighNodeUtilization

This strategy finds nodes that are under utilized and evicts pods from the nodes in the hope that these pods will be scheduled compactly into fewer nodes. Used in conjunction with node auto-scaling, this strategy is intended to help trigger down scaling of under utilized nodes. This strategy must be used with the scheduler scoring strategy MostAllocated. The parameters of this strategy are configured under nodeResourceUtilizationThresholds.

Note: On GKE, it is not possible to customize the default scheduler config. Instead, you can use the optimze-utilization autoscaling strategy, which has the same effect as enabling the MostAllocated scheduler plugin. Alternatively, you can deploy a second custom scheduler and edit that scheduler's config yourself.

The under utilization of nodes is determined by a configurable threshold thresholds. The threshold thresholds can be configured for cpu, memory, number of pods, and extended resources in terms of percentage. The percentage is calculated as the current resources requested on the node vs total allocatable. For pods, this means the number of pods on the node as a fraction of the pod capacity set for that node.

If a node's usage is below threshold for all (cpu, memory, number of pods and extended resources), the node is considered underutilized. Currently, pods request resource requirements are considered for computing node resource utilization. Any node above thresholds is considered appropriately utilized and is not considered for eviction.

The thresholds param could be tuned as per your cluster requirements. Note that this strategy evicts pods from underutilized nodes (those with usage below thresholds) so that they can be recreated in appropriately utilized nodes. The strategy will abort if any number of underutilized nodes or appropriately utilized nodes is zero.

NOTE: Node resource consumption is determined by the requests and limits of pods, not actual usage. This approach is chosen in order to maintain consistency with the kube-scheduler, which follows the same design for scheduling pods onto nodes. This means that resource usage as reported by Kubelet (or commands like kubectl top) may differ from the calculated consumption, due to these components reporting actual usage metrics. Implementing metrics-based descheduling is currently TODO for the project.

Parameters:

Name Type
thresholds map(string:int)
numberOfNodes int
evictableNamespaces (see namespace filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "HighNodeUtilization"
      args:
        thresholds:
          "cpu" : 20
          "memory": 20
          "pods": 20
        evictableNamespaces:
          namespaces:
            exclude:
            - "kube-system"
            - "namespace1"
    plugins:
      balance:
        enabled:
          - "HighNodeUtilization"

Policy should pass the following validation checks:

  • Three basic native types of resources are supported: cpu, memory and pods. If any of these resource types is not specified, all its thresholds default to 100%.
  • Extended resources are supported. For example, resource type nvidia.com/gpu is specified for GPU node utilization. Extended resources are optional, and will not be used to compute node's usage if it's not specified in thresholds explicitly.
  • thresholds can not be nil.
  • The valid range of the resource's percentage value is [0, 100]

There is another parameter associated with the HighNodeUtilization strategy, called numberOfNodes. This parameter can be configured to activate the strategy only when the number of under utilized nodes is above the configured value. This could be helpful in large clusters where a few nodes could go under utilized frequently or for a short period of time. By default, numberOfNodes is set to zero.

RemovePodsViolatingInterPodAntiAffinity

This strategy makes sure that pods violating interpod anti-affinity are removed from nodes. For example, if there is podA on a node and podB and podC (running on the same node) have anti-affinity rules which prohibit them to run on the same node, then podA will be evicted from the node so that podB and podC could run. This issue could happen, when the anti-affinity rules for podB and podC are created when they are already running on node.

Parameters:

Name Type
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingInterPodAntiAffinity"
    plugins:
      deschedule:
        enabled:
          - "RemovePodsViolatingInterPodAntiAffinity"

RemovePodsViolatingNodeAffinity

This strategy makes sure all pods violating node affinity are eventually removed from nodes. Node affinity rules allow a pod to specify requiredDuringSchedulingIgnoredDuringExecution type, which tells the scheduler to respect node affinity when scheduling the pod but kubelet to ignore in case node changes over time and no longer respects the affinity. When enabled, the strategy serves as a temporary implementation of requiredDuringSchedulingRequiredDuringExecution and evicts pod for kubelet that no longer respects node affinity.

For example, there is podA scheduled on nodeA which satisfies the node affinity rule requiredDuringSchedulingIgnoredDuringExecution at the time of scheduling. Over time nodeA stops to satisfy the rule. When the strategy gets executed and there is another node available that satisfies the node affinity rule, podA gets evicted from nodeA.

Parameters:

Name Type
nodeAffinityType list(string)
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingNodeAffinity"
      args:
        nodeAffinityType:
        - "requiredDuringSchedulingIgnoredDuringExecution"
    plugins:
      deschedule:
        enabled:
          - "RemovePodsViolatingNodeAffinity"

RemovePodsViolatingNodeTaints

This strategy makes sure that pods violating NoSchedule taints on nodes are removed. For example there is a pod "podA" with a toleration to tolerate a taint key=value:NoSchedule scheduled and running on the tainted node. If the node's taint is subsequently updated/removed, taint is no longer satisfied by its pods' tolerations and will be evicted.

Node taints can be excluded from consideration by specifying a list of excludedTaints. If a node taint key or key=value matches an excludedTaints entry, the taint will be ignored.

For example, excludedTaints entry "dedicated" would match all taints with key "dedicated", regardless of value. excludedTaints entry "dedicated=special-user" would match taints with key "dedicated" and value "special-user".

Parameters:

Name Type
excludedTaints list(string)
includePreferNoSchedule bool
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingNodeTaints"
      args:
        excludedTaints:
        - dedicated=special-user # exclude taints with key "dedicated" and value "special-user"
        - reserved # exclude all taints with key "reserved"
    plugins:
      deschedule:
        enabled:
          - "RemovePodsViolatingNodeTaints"

RemovePodsViolatingTopologySpreadConstraint

This strategy makes sure that pods violating topology spread constraints are evicted from nodes. Specifically, it tries to evict the minimum number of pods required to balance topology domains to within each constraint's maxSkew. This strategy requires k8s version 1.18 at a minimum.

By default, this strategy only deals with hard constraints, setting parameter includeSoftConstraints to true will include soft constraints.

Strategy parameter labelSelector is not utilized when balancing topology domains and is only applied during eviction to determine if the pod can be evicted.

Parameters:

Name Type
includeSoftConstraints bool
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsViolatingTopologySpreadConstraint"
      args:
        includeSoftConstraints: false
    plugins:
      balance:
        enabled:
          - "RemovePodsViolatingTopologySpreadConstraint"

RemovePodsHavingTooManyRestarts

This strategy makes sure that pods having too many restarts are removed from nodes. For example a pod with EBS/PD that can't get the volume/disk attached to the instance, then the pod should be re-scheduled to other nodes. Its parameters include podRestartThreshold, which is the number of restarts (summed over all eligible containers) at which a pod should be evicted, and includingInitContainers, which determines whether init container restarts should be factored into that calculation.

Parameters:

Name Type
podRestartThreshold int
includingInitContainers bool
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemovePodsHavingTooManyRestarts"
      args:
        podRestartThreshold: 100
        includingInitContainers: true
    plugins:
      deschedule:
        enabled:
          - "RemovePodsHavingTooManyRestarts"

PodLifeTime

This strategy evicts pods that are older than maxPodLifeTimeSeconds.

You can also specify states parameter to only evict pods matching the following conditions:

If a value for states or podStatusPhases is not specified, Pods in any state (even Running) are considered for eviction.

Parameters:

Name Type Notes
maxPodLifeTimeSeconds int
states list(string) Only supported in v0.25+
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        states:
        - "Pending"
        - "PodInitializing"
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

RemoveFailedPods

This strategy evicts pods that are in failed status phase. You can provide an optional parameter to filter by failed reasons. reasons can be expanded to include reasons of InitContainers as well by setting the optional parameter includingInitContainers to true. You can specify an optional parameter minPodLifetimeSeconds to evict pods that are older than specified seconds. Lastly, you can specify the optional parameter excludeOwnerKinds and if a pod has any of these Kinds listed as an OwnerRef, that pod will not be considered for eviction.

Parameters:

Name Type
minPodLifetimeSeconds uint
excludeOwnerKinds list(string)
reasons list(string)
includingInitContainers bool
namespaces (see namespace filtering)
labelSelector (see label filtering)

Example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "RemoveFailedPods"
      args:
        reasons:
        - "NodeAffinity"
        includingInitContainers: true
        excludeOwnerKinds:
        - "Job"
        minPodLifetimeSeconds: 3600
    plugins:
      deschedule:
        enabled:
          - "RemoveFailedPods"

Filter Pods

Namespace filtering

The following strategies accept a namespaces parameter which allows to specify a list of including, resp. excluding namespaces:

  • PodLifeTime
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingInterPodAntiAffinity
  • RemoveDuplicates
  • RemovePodsViolatingTopologySpreadConstraint
  • RemoveFailedPods

The following strategies accept a evictableNamespaces parameter which allows to specify a list of excluding namespaces:

  • LowNodeUtilization and HighNodeUtilization (Only filtered right before eviction)

For example with PodLifeTime:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          include:
          - "namespace1"
          - "namespace2"
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

In the example PodLifeTime gets executed only over namespace1 and namespace2. The similar holds for exclude field:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        namespaces:
          exclude:
          - "namespace1"
          - "namespace2"
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

The strategy gets executed over all namespaces but namespace1 and namespace2.

It's not allowed to compute include with exclude field.

Priority filtering

Priority threshold can be configured via the Default Evictor Filter, and, only pods under the threshold can be evicted. You can specify this threshold by setting priorityThreshold.name(setting the threshold to the value of the given priority class) or priorityThreshold.value(directly setting the threshold) parameters. By default, this threshold is set to the value of system-cluster-critical priority class.

Note: Setting evictSystemCriticalPods to true disables priority filtering entirely.

E.g.

Setting priorityThreshold value

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        priorityThreshold:
          value: 10000
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Setting Priority Threshold Class Name (priorityThreshold.name)

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        priorityThreshold:
          name: "priorityClassName1"
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Note that you can't configure both priorityThreshold.name and priorityThreshold.value, if the given priority class does not exist, descheduler won't create it and will throw an error.

Label filtering

The following strategies can configure a standard kubernetes labelSelector to filter pods by their labels:

  • PodLifeTime
  • RemovePodsHavingTooManyRestarts
  • RemovePodsViolatingNodeTaints
  • RemovePodsViolatingNodeAffinity
  • RemovePodsViolatingInterPodAntiAffinity
  • RemovePodsViolatingTopologySpreadConstraint
  • RemoveFailedPods

This allows running strategies among pods the descheduler is interested in.

For example:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
        labelSelector:
          matchLabels:
            component: redis
          matchExpressions:
            - {key: tier, operator: In, values: [cache]}
            - {key: environment, operator: NotIn, values: [dev]}
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Node Fit filtering

NodeFit can be configured via the Default Evictor Filter. If set to true the descheduler will consider whether or not the pods that meet eviction criteria will fit on other nodes before evicting them. If a pod cannot be rescheduled to another node, it will not be evicted. Currently the following criteria are considered when setting nodeFit to true:

  • A nodeSelector on the pod
  • Any tolerations on the pod and any taints on the other nodes
  • nodeAffinity on the pod
  • Resource requests made by the pod and the resources available on other nodes
  • Whether any of the other nodes are marked as unschedulable

E.g.

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: ProfileName
    pluginConfig:
    - name: "DefaultEvictor"
      args:
        nodeFit: true
    - name: "PodLifeTime"
      args:
        maxPodLifeTimeSeconds: 86400
    plugins:
      deschedule:
        enabled:
          - "PodLifeTime"

Note that node fit filtering references the current pod spec, and not that of it's owner. Thus, if the pod is owned by a ReplicationController (and that ReplicationController was modified recently), the pod may be running with an outdated spec, which the descheduler will reference when determining node fit. This is expected behavior as the descheduler is a "best-effort" mechanism.

Using Deployments instead of ReplicationControllers provides an automated rollout of pod spec changes, therefore ensuring that the descheduler has an up-to-date view of the cluster state.

Pod Evictions

When the descheduler decides to evict pods from a node, it employs the following general mechanism:

  • Critical pods (with priorityClassName set to system-cluster-critical or system-node-critical) are never evicted (unless evictSystemCriticalPods: true is set).
  • Pods (static or mirrored pods or standalone pods) not part of an ReplicationController, ReplicaSet(Deployment), StatefulSet, or Job are never evicted because these pods won't be recreated. (Standalone pods in failed status phase can be evicted by setting evictFailedBarePods: true)
  • Pods associated with DaemonSets are never evicted.
  • Pods with local storage are never evicted (unless evictLocalStoragePods: true is set).
  • Pods with PVCs are evicted (unless ignorePvcPods: true is set).
  • In LowNodeUtilization and RemovePodsViolatingInterPodAntiAffinity, pods are evicted by their priority from low to high, and if they have same priority, best effort pods are evicted before burstable and guaranteed pods.
  • All types of pods with the annotation descheduler.alpha.kubernetes.io/evict are eligible for eviction. This annotation is used to override checks which prevent eviction and users can select which pod is evicted. Users should know how and if the pod will be recreated. The annotation only affects internal descheduler checks. The anti-disruption protection provided by the /eviction subresource is still respected.
  • Pods with a non-nil DeletionTimestamp are not evicted by default.

Setting --v=4 or greater on the Descheduler will log all reasons why any pod is not evictable.

Pod Disruption Budget (PDB)

Pods subject to a Pod Disruption Budget(PDB) are not evicted if descheduling violates its PDB. The pods are evicted by using the eviction subresource to handle PDB.

High Availability

In High Availability mode, Descheduler starts leader election process in Kubernetes. You can activate HA mode if you choose to deploy your application as Deployment.

Deployment starts with 1 replica by default. If you want to use more than 1 replica, you must consider enable High Availability mode since we don't want to run descheduler pods simultaneously.

Configure HA Mode

The leader election process can be enabled by setting --leader-elect in the CLI. You can also set --set=leaderElection.enabled=true flag if you are using Helm.

To get best results from HA mode some additional configurations might require:

  • Configure a podAntiAffinity rule if you want to schedule onto a node only if that node is in the same zone as at least one already-running descheduler
  • Set the replica count greater than 1

Metrics

name type description
build_info gauge constant 1
pods_evicted CounterVec total number of pods evicted

The metrics are served through https://localhost:10258/metrics by default. The address and port can be changed by setting --binding-address and --secure-port flags.

Compatibility Matrix

The below compatibility matrix shows the k8s client package(client-go, apimachinery, etc) versions that descheduler is compiled with. At this time descheduler does not have a hard dependency to a specific k8s release. However a particular descheduler release is only tested against the three latest k8s minor versions. For example descheduler v0.18 should work with k8s v1.18, v1.17, and v1.16.

Starting with descheduler release v0.18 the minor version of descheduler matches the minor version of the k8s client packages that it is compiled with.

Descheduler Supported Kubernetes Version
v0.27 v1.27
v0.26 v1.26
v0.25 v1.25
v0.24 v1.24
v0.23 v1.23
v0.22 v1.22
v0.21 v1.21
v0.20 v1.20
v0.19 v1.19
v0.18 v1.18
v0.10 v1.17
v0.4-v0.9 v1.9+
v0.1-v0.3 v1.7-v1.8

Getting Involved and Contributing

Are you interested in contributing to descheduler? We, the maintainers and community, would love your suggestions, contributions, and help! Also, the maintainers can be contacted at any time to learn more about how to get involved.

To get started writing code see the contributor guide in the /docs directory.

In the interest of getting more new people involved we tag issues with [good first issue][good_first_issue]. These are typically issues that have smaller scope but are good ways to start to get acquainted with the codebase.

We also encourage ALL active community participants to act as if they are maintainers, even if you don't have "official" write permissions. This is a community effort, we are here to serve the Kubernetes community. If you have an active interest and you want to get involved, you have real power! Don't assume that the only people who can get things done around here are the "maintainers".

We also would love to add more "official" maintainers, so show us what you can do!

This repository uses the Kubernetes bots. See a full list of the commands [here][prow].

Communicating With Contributors

You can reach the contributors of this project at:

Learn how to engage with the Kubernetes community on the community page.

Roadmap

This roadmap is not in any particular order.

  • Consideration of pod affinity
  • Strategy to consider number of pending pods
  • Integration with cluster autoscaler
  • Integration with metrics providers for obtaining real load metrics
  • Consideration of Kubernetes's scheduler's predicates

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

More Repositories

1

kubespray

Deploy a Production Ready Kubernetes Cluster
Jinja
14,679
star
2

kind

Kubernetes IN Docker - local clusters for testing Kubernetes
Go
12,623
star
3

kustomize

Customization of kubernetes YAML configurations
Go
10,363
star
4

kubebuilder

Kubebuilder - SDK for building Kubernetes APIs using CRDs
Go
7,298
star
5

external-dns

Configure external DNS servers (AWS Route53, Google CloudDNS and others) for Kubernetes Ingresses and Services
Go
6,672
star
6

krew

📦 Find and install kubectl plugins
Go
6,009
star
7

metrics-server

Scalable and efficient source of container resource metrics for Kubernetes built-in autoscaling pipelines.
Go
4,761
star
8

aws-load-balancer-controller

A Kubernetes controller for Elastic Load Balancers
Go
3,703
star
9

cluster-api

Home for Cluster API, a subproject of sig-cluster-lifecycle
Go
2,944
star
10

kui

A hybrid command-line/UI development experience for cloud-native development
TypeScript
2,701
star
11

nfs-subdir-external-provisioner

Dynamic sub-dir volume provisioner on a remote NFS server.
Shell
2,244
star
12

controller-runtime

Repo for the controller-runtime subproject of kubebuilder (sig-apimachinery)
Go
2,240
star
13

kwok

Kubernetes WithOut Kubelet - Simulates thousands of Nodes and Clusters.
Go
2,182
star
14

aws-iam-authenticator

A tool to use AWS IAM credentials to authenticate to a Kubernetes cluster
Go
2,008
star
15

prometheus-adapter

An implementation of the custom.metrics.k8s.io API using Prometheus
Go
1,662
star
16

gateway-api

Repository for the next iteration of composite service (e.g. Ingress) and load balancing APIs.
Go
1,452
star
17

cri-tools

CLI and validation tools for Kubelet Container Runtime Interface (CRI) .
Go
1,333
star
18

secrets-store-csi-driver

Secrets Store CSI driver for Kubernetes secrets - Integrates secrets stores with Kubernetes via a CSI volume.
Go
1,139
star
19

kueue

Kubernetes-native Job Queueing
Go
986
star
20

sig-storage-local-static-provisioner

Static provisioner of local volumes
Go
973
star
21

scheduler-plugins

Repository for out-of-tree scheduler plugins based on scheduler framework.
Go
957
star
22

aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/
Go
883
star
23

apiserver-builder-alpha

apiserver-builder-alpha implements libraries and tools to quickly and easily build Kubernetes apiservers/controllers to support custom resource types based on APIServer Aggregation
Go
764
star
24

etcdadm

Go
748
star
25

kube-scheduler-simulator

The simulator for the Kubernetes scheduler
Go
706
star
26

aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Go
668
star
27

controller-tools

Tools to use with the controller-runtime libraries
Go
655
star
28

krew-index

Plugin index for https://github.com/kubernetes-sigs/krew. This repo is for plugin maintainers.
624
star
29

security-profiles-operator

The Kubernetes Security Profiles Operator
C
622
star
30

node-feature-discovery

Node feature discovery for Kubernetes
Go
595
star
31

cluster-api-provider-aws

Kubernetes Cluster API Provider AWS provides consistent deployment and day 2 operations of "self-managed" and EKS Kubernetes clusters on AWS.
Go
592
star
32

hierarchical-namespaces

Home of the Hierarchical Namespace Controller (HNC). Adds hierarchical policies and delegated creation to Kubernetes namespaces for improved in-cluster multitenancy.
Go
532
star
33

cluster-proportional-autoscaler

Kubernetes Cluster Proportional Autoscaler Container
Go
519
star
34

sig-storage-lib-external-provisioner

Go
502
star
35

alibaba-cloud-csi-driver

CSI Plugin for Kubernetes, Support Alibaba Cloud EBS/NAS/OSS/CPFS/LVM.
Go
500
star
36

application

Application metadata descriptor CRD
Go
488
star
37

custom-metrics-apiserver

Framework for implementing custom metrics support for Kubernetes
Go
457
star
38

e2e-framework

A Go framework for end-to-end testing of components running in Kubernetes clusters.
Go
395
star
39

cluster-capacity

Cluster capacity analysis
Go
390
star
40

nfs-ganesha-server-and-external-provisioner

NFS Ganesha Server and Volume Provisioner.
Shell
384
star
41

apiserver-network-proxy

Go
344
star
42

cluster-api-provider-vsphere

Go
339
star
43

image-builder

Tools for building Kubernetes disk images
Shell
325
star
44

kubetest2

Kubetest2 is the framework for launching and running end-to-end tests on Kubernetes.
Go
312
star
45

cluster-api-provider-nested

Cluster API Provider for Nested Clusters
Go
289
star
46

cluster-api-provider-azure

Cluster API implementation for Microsoft Azure
Go
283
star
47

bom

A utility to generate SPDX-compliant Bill of Materials manifests
Go
279
star
48

vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
Go
278
star
49

cluster-api-provider-openstack

Go
255
star
50

karpenter

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
Go
255
star
51

kubebuilder-declarative-pattern

A toolkit for building declarative operators with kubebuilder
Go
242
star
52

kpng

Reworking kube-proxy's architecture
Go
235
star
53

ingress2gateway

Convert Ingress resources to Gateway API resources
Go
225
star
54

cloud-provider-azure

Cloud provider for Azure
Go
222
star
55

blixt

Layer 4 Kubernetes load-balancer
Rust
220
star
56

aws-encryption-provider

APIServer encryption provider, backed by AWS KMS
Go
192
star
57

mcs-api

This repository hosts the Multi-Cluster Service APIs. Providers can import packages in this repo to ensure their multi-cluster service controller implementations will be compatible with MCS data planes.
Go
184
star
58

ip-masq-agent

Manage IP masquerade on nodes
Go
180
star
59

zeitgeist

Zeitgeist: the language-agnostic dependency checker
Go
168
star
60

cluster-api-provider-gcp

The GCP provider implementation for Cluster API
Go
165
star
61

contributor-playground

Dockerfile
163
star
62

cluster-addons

Addon operators for Kubernetes clusters.
Go
153
star
63

gcp-compute-persistent-disk-csi-driver

The Google Compute Engine Persistent Disk (GCE PD) Container Storage Interface (CSI) Storage Plugin.
Go
151
star
64

azurefile-csi-driver

Azure File CSI Driver
Go
145
star
65

promo-tools

Container and file artifact promotion tooling for the Kubernetes project
Go
136
star
66

cli-utils

This repo contains binaries that built from libraries in cli-runtime.
Go
134
star
67

azuredisk-csi-driver

Azure Disk CSI Driver
Go
132
star
68

kube-storage-version-migrator

Go
125
star
69

blob-csi-driver

Azure Blob Storage CSI driver
Go
116
star
70

usage-metrics-collector

High fidelity and scalable capacity and usage metrics for Kubernetes clusters
Go
116
star
71

aws-fsx-csi-driver

CSI Driver of Amazon FSx for Lustre https://aws.amazon.com/fsx/lustre/
Go
115
star
72

boskos

Boskos is a resource management service that provides reservation and lifecycle management of a variety of different kinds of resources.
Go
113
star
73

downloadkubernetes

Download kubernetes binaries more easily
Go
110
star
74

sig-windows-tools

Repository for tools and artifacts related to the sig-windows charter in Kubernetes. Scripts to assist kubeadm and wincat and flannel will be hosted here.
PowerShell
108
star
75

cluster-api-operator

Home for Cluster API Operator, a subproject of sig-cluster-lifecycle
Go
107
star
76

cluster-api-provider-digitalocean

The DigitalOcean provider implementation of the Cluster Management API
Go
106
star
77

cluster-api-provider-kubevirt

Cluster API Provider for KubeVirt
Go
96
star
78

cluster-api-provider-packet

Cluster API Provider Packet (now Equinix Metal)
Go
94
star
79

structured-merge-diff

Test cases and implementation for "server-side apply"
Go
92
star
80

slack-infra

Tooling for kubernetes.slack.com
Go
90
star
81

dashboard-metrics-scraper

Container to scrape, store, and retrieve a window of time from the Metrics Server.
Go
84
star
82

apiserver-runtime

Libraries for implementing aggregated apiservers
Go
81
star
83

cli-experimental

Experimental Kubectl libraries and commands.
Go
79
star
84

lwkd

Last Week in Kubernetes Development
HTML
78
star
85

gcp-filestore-csi-driver

The Google Cloud Filestore Container Storage Interface (CSI) Plugin.
Go
78
star
86

kube-scheduler-wasm-extension

All the things to make the scheduler extendable with wasm.
Go
77
star
87

container-object-storage-interface-controller

Container Object Storage Interface (COSI) controller responsible to manage lifecycle of COSI objects.
Go
74
star
88

jobset

JobSet: An API for managing a group of Jobs as a unit
Go
73
star
89

sig-windows-dev-tools

This is a batteries included local development environment for Kubernetes on Windows.
PowerShell
73
star
90

cluster-api-addon-provider-helm

Cluster API Add-on Provider for Helm is a extends the functionality of Cluster API by providing a solution for managing the installation, configuration, upgrade, and deletion of Cluster add-ons using Helm charts.
Go
70
star
91

cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
Go
70
star
92

kernel-module-management

The kernel module management operator builds, signs and loads kernel modules in Kubernetes clusters..
Go
70
star
93

reference-docs

Tools to build reference documentation for Kubernetes APIs and CLIs.
HTML
69
star
94

cluster-api-provider-ibmcloud

Cluster API Provider for IBM Cloud
Go
59
star
95

community-images

kubectl plugin that displays images running in a Kubernetes cluster that were pulled from community owned repositories and warn the user to switch repositories if needed
Go
58
star
96

wg-policy-prototypes

A place for policy work group related proposals and prototypes.
Go
58
star
97

container-object-storage-interface-spec

Container Object Storage (COSI) Specification
Shell
57
star
98

container-object-storage-interface-api

Container Object Storage Interface (COSI) API responsible to define API for COSI objects.
Go
55
star
99

lws

LeaderWorkerSet: An API for deploying a group of pods as a unit of replication
Go
55
star
100

kubectl-validate

Go
54
star