How Alibaba Ensure K8S Performance At Large Scale
Last updated
Last updated
The performance testing tool used is Kubemark
Read and write latency
Space limitation (etcd capacity)
Memory limitation
The API server encountered extremely high latency when querying pods and nodes, and the back-end etcd sometimes also ran out of memory due to concurrent query requests.
The controller encountered high processing latency because it couldn’t detect the latest changes from the API server. For abnormal restarts, service recovery would often take several minutes.
Encountered high latency and low throughput.
Dump some date stored in etcd into a cache cluster, e.g. the heartbeat change log from kubelet. In Alibab's case, the cache name is tair (a elastic cache platform)
Partition the different types of objects on the API server to different etcd clusters
Improve the internal etcd implementation by making the underlayer bbolt db more robust (CNCF paper)
The node heartbeat generates lots of chagelog for etcd, and generates lots of CPU overheards
Using the K8S build-in Lease API which is much smaller than Node object
In consideration of compatibility, still updates the Node object, but low frequency.
Although K8S could configure multi-master to achieve the HA, but the load among the API servers may be unbalanced. During the upgrade or node restarts, there would be unbalanced traffic for API server.
Add load balancer on API Server side
Add load balancer on Kubelet side
In normal cases, load switching does not occur after the kubelet is connected to an API server. That is because, in most cases, client reuses the same TLS connection at the lower layer.
This is not invented by Alibaba, it is actually the SIG enhancement introduced in Feb 2019. What it does is basically let API Server pushes the latest ResourceVersion value to the client at the proper time so that the client version can keep up with the version updates on the API server. So that, it reduces the client with "too old version" error, and reduces the frequency that client relist the object from API Server.
Stage: Alpha
Feature group: API
A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default.
The “bookmark“ watch event is used as a checkpoint, indicating that all objects up to a given resourceVersion that the client is requesting have already been sent.
For example, if a given Watch is requesting all the events starting with resourceVersion X and the API knows that this Watch is not interested in any of the events up to a much higher version, the API can skip sending all these events using a bookmark, avoiding unnecessary processing on both sides.
This proposal make restarting watches cheaper from kube-apiserver performance perspective.
Client can access the API Server through a direct query. And API Server process the query request by reading data from etcd. There are following issues:
Unable to support indexing: The pod for node query must first obtain all the pods in cluster, which result in a huge overhead.
Pagination against large amount of request data: Queries of massive data are completed through pagination, which results in multiple round trips that greatly reduce performance.
Quorum read to query etcd brings the query overhead.
Alibaba uses "Data collaboration" mechanism to improve the describe node time from 5s to 0.3s. The workflow shows as below:
The client queries the API servers at time t0.
The API servers request the current data version rv@t0 from etcd.
The API servers update their request progress and wait until the data version of the reflector reaches rv@t0.
A response is sent to the client through the cache.
When client initiates a request, API Server receives and processes it. If client terminates the request, API Server will continue processing the request regardless. This results in a backlog of resources. If client tries to initiate a new request, the API server is still likely to process the new request. It can easily cause the API Server out of memory and crash.
Alibaba's solution is that: After the client’s request ends, the API Service should also reclaim the resources of the request and stop the request.
At Alibaba, the API Server crashes are mainly caused by:
Issues during API Server restarts or updates: whenever the API server needs to be restarted or updated, all of Kubernetes’s components must be reconnected to it, which means that each and every component will send requests to the API Server.
To resolve the problem described, the team at Alibaba used the “Active Reject” function. With this enabled, the API Server will reject larger incoming requests that the cache cannot handle. A rejection returns a 429 error code to the client, which will cause the client to wait some time before sending a new request. Generally speaking, this can optimize the IO between the API server and client, reducing the chances of the API server crashing.
The Client has a bug that causes an abnormal number of requests: One of the most typical bugs is a DaemonSet component bug, which causes the number of request to be multipled by the number of nodes, which then in turn ends up crashing the API Server.
To resolve this problem, you can use a method of emergency traffic control implemented with User-Agent to dynamically limit requests based on their source. With this enabled, you can easily restrict requests of the problem component that you identify as having the bug based on the data collected in the monitoring chart. By stopping its requests, you can thus project the API Server, and with the server protected, you can procede to fix the component with the bug. If the Server crashes, you wouldn’t be able to fix the bug. The reason you want to control traffic with User-Agent, instead of Identity, for this is because an Identity request can be consume a lot of resources, which could cause the server to crash.
In a 10,000-node production cluster, one controller stores nearly one million objects. It consumes substantial overhead to retrieve these objects from the API server and deserialize them, and it may take several minutes to restart the controller for recovery.
To resovle the problem, the following two improvements are done by Alibaba:
(Preload informer) Controller informers are pre-started to pre-load the data required by the controller.
(Fail Over) When the active controller is upgraded, Leader Lease is actively released to enable the standby controller to immediately take over services from the active controller.
Alibaba uses its own self-developed scheduler architecture, so here just share two basic ideas:
Equivalence classes: A typical capacity expansion request is intended to scale up multiple containers at a time. Therefore, the team divided the requests in a pending queue into equivalence classes to batch-process these requests. This greatly reduced the number of predicates and priorities.
Relaxed randomization: When a cluster provides many candidate nodes for serving one scheduling request, you only need to select enough nodes to process the request, without having to evaluate all nodes in the cluster. This method can improve the scheduling performance at the expense of solving accuracy.
The storage capacity of etcd was improved by separating indexes and data and sharding data. The performance of etcd in storing massive data was greatly improved by optimizing the block allocation algorithm of the bbolt db storage engine at the bottom layer of etcd. Last, the system architecture was simplified by building large-scale Kubernetes clusters based on a single etcd cluster.
The List performance bottleneck was removed to enable the stability of operation of 10,000-node Kubernetes clusters through a series of enhancements, such as Kubernetes lightweight heartbeats, improved load balancing among API servers in high-availability (HA) clusters, List-Watch bookmark, indexing, and caching.
Hot standby mode greatly reduced the service interruption time during active/standby failover on controllers and schedulers. This also improved cluster availability.
The performance of Alibaba-developed schedulers was effectively optimized through equivalence classes and relaxed randomization.