Kubernetes v1.35: Introducing Workload Aware Scheduling
Scheduling large workloads is a much more complex and fragile operation than scheduling a single Pod, as it often requires considering all Pods together instead of scheduling each one independently. For example, when scheduling a machine learning batch job, you often need to place each worker strategically, such as on the same rack, to make the entire process as efficient as possible. At the same time, the Pods that are part of such a workload are very often identical from the scheduling perspective, which fundamentally changes how this process should look.
There are many custom schedulers adapted to perform workload scheduling efficiently,
but considering how common and important workload scheduling is to Kubernetes users,
especially in the AI era with the growing number of use cases,
it is high time to make workloads a first-class citizen for kube-scheduler and support them natively.
Workload aware scheduling
The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements. These are part of a wider effort that is aiming to improve scheduling and management of workloads. The effort will span over many SIGs and releases, and is supposed to gradually expand capabilities of the system toward reaching the north star goal, which is seamless workload scheduling and management in Kubernetes including, but not limited to, preemption and autoscaling.
Kubernetes v1.35 introduces the Workload API that you can use to describe the desired shape
as well as scheduling-oriented requirements of the workload. It comes with an initial implementation
of gang scheduling that instructs the kube-scheduler to schedule gang Pods in the all-or-nothing fashion.
Finally, we improved scheduling of identical Pods (that typically make a gang) to speed up the process
thanks to the opportunistic batching feature.
Workload API
The new Workload API resource is part of the scheduling.k8s.io/v1alpha1
API group.
This resource acts as a structured, machine-readable definition of the scheduling requirements
of a multi-Pod application. While user-facing workloads like Jobs define what to run, the Workload resource
determines how a group of Pods should be scheduled and how its placement should be managed
throughout its lifecycle.
A Workload allows you to define a group of Pods and apply a scheduling policy to them.
Here is what a gang scheduling configuration looks like. You can define a podGroup named workers
and apply the gang policy with a minCount of 4.
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: training-job-workload
namespace: some-ns
spec:
podGroups:
- name: workers
policy:
gang:
# The gang is schedulable only if 4 pods can run at once
minCount: 4
When you create your Pods, you link them to this Workload using the new workloadRef field:
apiVersion: v1
kind: Pod
metadata:
name: worker-0
namespace: some-ns
spec:
workloadRef:
name: training-job-workload
podGroup: workers
...
How gang scheduling works
The gang policy enforces all-or-nothing placement. Without gang scheduling,
a Job might be partially scheduled, consuming resources without being able to run,
leading to resource wastage and potential deadlocks.
When you create Pods that are part of a gang-scheduled pod group, the scheduler's GangScheduling
plugin manages the lifecycle independently for each pod group (or replica key):
-
When you create your Pods (or a controller makes them for you), the scheduler blocks them from scheduling, until:
- The referenced Workload object is created.
- The referenced pod group exists in a Workload.
- The number of pending Pods in that group meets your
minCount.
-
Once enough Pods arrive, the scheduler tries to place them. However, instead of binding them to nodes immediately, the Pods wait at a
Permitgate. -
The scheduler checks if it has found valid assignments for the entire group (at least the
minCount).- If there is room for the group, the gate opens, and all Pods are bound to nodes.
- If only a subset of the group pods was successfully scheduled within a timeout (set to 5 minutes), the scheduler rejects all of the Pods in the group. They go back to the queue, freeing up the reserved resources for other workloads.
We'd like to point out that that while this is a first implementation, the Kubernetes project firmly intends to improve and expand the gang scheduling algorithm in future releases. Benefits we hope to deliver include a single-cycle scheduling phase for a whole gang, workload-level preemption, and more, moving towards the north star goal.
Opportunistic batching
In addition to explicit gang scheduling, v1.35 introduces opportunistic batching. This is a Beta feature that improves scheduling latency for identical Pods.
Unlike gang scheduling, this feature does not require the Workload API or any explicit opt-in on the user's part. It works opportunistically within the scheduler by identifying Pods that have identical scheduling requirements (container images, resource requests, affinities, etc.). When the scheduler processes a Pod, it can reuse the feasibility calculations for subsequent identical Pods in the queue, significantly speeding up the process.
Most users will benefit from this optimization automatically, without taking any special steps, provided their Pods meet the following criteria.
Restrictions
Opportunistic batching works under specific conditions. All fields used by the kube-scheduler
to find a placement must be identical between Pods. Additionally, using some features
disables the batching mechanism for those Pods to ensure correctness.
Note that you may need to review your kube-scheduler configuration
to ensure it is not implicitly disabling batching for your workloads.
See the docs for more details about restrictions.
The north star vision
The project has a broad ambition to deliver workload aware scheduling. These new APIs and scheduling enhancements are just the first steps. In the near future, the effort aims to tackle:
- Introducing a workload scheduling phase
- Improved support for multi-node DRA and topology aware scheduling
- Workload-level preemption
- Improved integration between scheduling and autoscaling
- Improved interaction with external workload schedulers
- Managing placement of workloads throughout their entire lifecycle
- Multi-workload scheduling simulations
And more. The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.
Getting started
To try the workload aware scheduling improvements:
- Workload API: Enable the
GenericWorkloadfeature gate on bothkube-apiserverandkube-scheduler, and ensure thescheduling.k8s.io/v1alpha1API group is enabled. - Gang scheduling: Enable the
GangSchedulingfeature gate onkube-scheduler(requires the Workload API to be enabled). - Opportunistic batching: As a Beta feature, it is enabled by default in v1.35.
You can disable it using the
OpportunisticBatchingfeature gate onkube-schedulerif needed.
We encourage you to try out workload aware scheduling in your test clusters and share your experiences to help shape the future of Kubernetes scheduling. You can send your feedback by:
- Reaching out via Slack (#sig-scheduling).
- Commenting on the workload aware scheduling tracking issue
- Filing a new issue in the Kubernetes repository.
Learn more
- Read the KEPs for Workload API and gang scheduling and Opportunistic batching.
- Track the Workload aware scheduling issue for recent updates.