404 Not Found
  • Introduction
  • Monitoring related
    • K8s cluster monitoring
    • Monitor Jenkins with G.A.P on K8s cluster
    • Monitoring tools | projects
      • Grafana
      • AlertManager
      • Prometheus
      • Wavefront
  • Logging related
    • BOSH logs
    • How to gather systemd log
    • K8s cluster logging
    • Logging tools | projects
      • vRealize Log Insight
      • Fluentd
      • syslog vs fluentd
  • Having fun with docker
    • Using docker-compose for redmine
    • Customize Fluentd docker image
  • K8S or Apache Mesos
  • K8S Related
    • Main Architecture
      • Master components
        • API Server
        • etcd
        • Controller Manager
        • Kube Scheduler
      • Worker components
        • kubelet
        • kube-proxy
    • K8S Storage
      • Volume Provisioning
      • Understand CSI
      • How to write CSI
      • VMware CNS
      • K8S storage e2e experiment under VMware vSphere
      • Experiment on Persistent Volume Access Mode
      • Design: Storage in Cluster-API architecture
    • K8S Networking
      • Ingress
      • Endpoints
    • K8S Policies
      • Resource Quotas
    • K8S Management Platform
    • K8S Tests Tool
    • K8S Extension
      • CRDs
        • Custom Resources
        • Custom Controllers
        • How to user code-generator
        • K8S Operators
        • Operators Development Tools
          • Kubebuilder
          • Metacontroller
          • Operator SDK
      • Custom API Server
    • K8S Resource CRUD Workflow
    • K8S Garbage Collection
  • K8S CONTROLLER RELATED
    • IsController: true
    • Controller clients
  • PKS RELATED
    • How to Access VMs and Databases related to PKS
    • PKS Basics
    • BOSH Director
    • Backup and Restore on Ent. PKS with Velero
  • CICD RELATED
    • Configure Jenkins to run on K8S
    • Customize Jenkins JNLP slave image
    • Jenkins global shared libs
  • Google Anthos
    • Google Anthos Day from KubeCon 2019 San Diego
    • Migrate for Anthos
    • Config Connector
  • SYSTEM DESIGN RELATED
    • Design Data Intensive Application - Notes
      • RSM
        • Reliability
        • Scalability
      • Data models and Query Languages
      • Storage and Retrieval
    • How Alibaba Ensure K8S Performance At Large Scale
  • Miscellaneous
    • Knative
    • Serverless
    • Service Mesh
    • gRPC
    • Local persistent volumes
    • ownerReferences in K8S
    • File(NAS) vs Block(SAN) vs Object storage
    • KubeVirt
    • Why K8S HA chooses 3 instead of 5..6..7 as the size of masters?
    • goroutine & go channel
    • How to make docker images smaller
Powered by GitBook
On this page
  • Alibaba environment
  • Problem encountered
  • etcd
  • API server
  • Controller
  • Scheduler
  • Improvement to etct
  • Improvement to API Server
  • Enhance the node heartbeat
  • Load balancing for API Server
  • List-Watch Bookmark feature
  • Cacher and Indexing
  • Content aware problem
  • Request flood
  • Improvement to controller
  • Customized schedulers
  • Conclusion
  • References

Was this helpful?

  1. SYSTEM DESIGN RELATED

How Alibaba Ensure K8S Performance At Large Scale

PreviousStorage and RetrievalNextKnative

Last updated 5 years ago

Was this helpful?

Alibaba environment

The performance testing tool used is

Problem encountered

etcd

  • Read and write latency

  • Memory limitation

API server

The API server encountered extremely high latency when querying pods and nodes, and the back-end etcd sometimes also ran out of memory due to concurrent query requests.

Controller

The controller encountered high processing latency because it couldn’t detect the latest changes from the API server. For abnormal restarts, service recovery would often take several minutes.

Scheduler

Encountered high latency and low throughput.

Improvement to etct

  1. Partition the different types of objects on the API server to different etcd clusters

Improvement to API Server

Enhance the node heartbeat

The node heartbeat generates lots of chagelog for etcd, and generates lots of CPU overheards

  • Using the K8S build-in Lease API which is much smaller than Node object

  • In consideration of compatibility, still updates the Node object, but low frequency.

Load balancing for API Server

Although K8S could configure multi-master to achieve the HA, but the load among the API servers may be unbalanced. During the upgrade or node restarts, there would be unbalanced traffic for API server.

  • Add load balancer on API Server side

  • Add load balancer on Kubelet side

In normal cases, load switching does not occur after the kubelet is connected to an API server. That is because, in most cases, client reuses the same TLS connection at the lower layer.

List-Watch Bookmark feature

Stage: Alpha

Feature group: API

A given Kubernetes server will only preserve a historical list of changes for a limited time. Clusters using etcd3 preserve changes in the last 5 minutes by default.

The “bookmark“ watch event is used as a checkpoint, indicating that all objects up to a given resourceVersion that the client is requesting have already been sent.

For example, if a given Watch is requesting all the events starting with resourceVersion X and the API knows that this Watch is not interested in any of the events up to a much higher version, the API can skip sending all these events using a bookmark, avoiding unnecessary processing on both sides.

This proposal make restarting watches cheaper from kube-apiserver performance perspective.

Cacher and Indexing

Client can access the API Server through a direct query. And API Server process the query request by reading data from etcd. There are following issues:

  • Unable to support indexing: The pod for node query must first obtain all the pods in cluster, which result in a huge overhead.

  • Pagination against large amount of request data: Queries of massive data are completed through pagination, which results in multiple round trips that greatly reduce performance.

  • Quorum read to query etcd brings the query overhead.

Alibaba uses "Data collaboration" mechanism to improve the describe node time from 5s to 0.3s. The workflow shows as below:

  1. The client queries the API servers at time t0.

  2. The API servers request the current data version rv@t0 from etcd.

  3. The API servers update their request progress and wait until the data version of the reflector reaches rv@t0.

  4. A response is sent to the client through the cache.

Content aware problem

When client initiates a request, API Server receives and processes it. If client terminates the request, API Server will continue processing the request regardless. This results in a backlog of resources. If client tries to initiate a new request, the API server is still likely to process the new request. It can easily cause the API Server out of memory and crash.

Alibaba's solution is that: After the client’s request ends, the API Service should also reclaim the resources of the request and stop the request.

Request flood

At Alibaba, the API Server crashes are mainly caused by:

  • Issues during API Server restarts or updates: whenever the API server needs to be restarted or updated, all of Kubernetes’s components must be reconnected to it, which means that each and every component will send requests to the API Server.

To resolve the problem described, the team at Alibaba used the “Active Reject” function. With this enabled, the API Server will reject larger incoming requests that the cache cannot handle. A rejection returns a 429 error code to the client, which will cause the client to wait some time before sending a new request. Generally speaking, this can optimize the IO between the API server and client, reducing the chances of the API server crashing.

  • The Client has a bug that causes an abnormal number of requests: One of the most typical bugs is a DaemonSet component bug, which causes the number of request to be multipled by the number of nodes, which then in turn ends up crashing the API Server.

To resolve this problem, you can use a method of emergency traffic control implemented with User-Agent to dynamically limit requests based on their source. With this enabled, you can easily restrict requests of the problem component that you identify as having the bug based on the data collected in the monitoring chart. By stopping its requests, you can thus project the API Server, and with the server protected, you can procede to fix the component with the bug. If the Server crashes, you wouldn’t be able to fix the bug. The reason you want to control traffic with User-Agent, instead of Identity, for this is because an Identity request can be consume a lot of resources, which could cause the server to crash.

Improvement to controller

In a 10,000-node production cluster, one controller stores nearly one million objects. It consumes substantial overhead to retrieve these objects from the API server and deserialize them, and it may take several minutes to restart the controller for recovery.

To resovle the problem, the following two improvements are done by Alibaba:

  • (Preload informer) Controller informers are pre-started to pre-load the data required by the controller.

  • (Fail Over) When the active controller is upgraded, Leader Lease is actively released to enable the standby controller to immediately take over services from the active controller.

Customized schedulers

Alibaba uses its own self-developed scheduler architecture, so here just share two basic ideas:

  1. Equivalence classes: A typical capacity expansion request is intended to scale up multiple containers at a time. Therefore, the team divided the requests in a pending queue into equivalence classes to batch-process these requests. This greatly reduced the number of predicates and priorities.

  2. Relaxed randomization: When a cluster provides many candidate nodes for serving one scheduling request, you only need to select enough nodes to process the request, without having to evaluate all nodes in the cluster. This method can improve the scheduling performance at the expense of solving accuracy.

Conclusion

  1. The storage capacity of etcd was improved by separating indexes and data and sharding data. The performance of etcd in storing massive data was greatly improved by optimizing the block allocation algorithm of the bbolt db storage engine at the bottom layer of etcd. Last, the system architecture was simplified by building large-scale Kubernetes clusters based on a single etcd cluster.

  2. The List performance bottleneck was removed to enable the stability of operation of 10,000-node Kubernetes clusters through a series of enhancements, such as Kubernetes lightweight heartbeats, improved load balancing among API servers in high-availability (HA) clusters, List-Watch bookmark, indexing, and caching.

  3. Hot standby mode greatly reduced the service interruption time during active/standby failover on controllers and schedulers. This also improved cluster availability.

  4. The performance of Alibaba-developed schedulers was effectively optimized through equivalence classes and relaxed randomization.

References

Space limitation ()

Dump some date stored in etcd into a cache cluster, e.g. the heartbeat change log from kubelet. In Alibab's case, the cache name is (a elastic cache platform)

Improve the internal etcd implementation by making the underlayer bbolt db more robust (CNCF )

This is not invented by Alibaba, it is actually the SIG introduced in Feb 2019. What it does is basically let API Server pushes the latest ResourceVersion value to the client at the proper time so that the client version can keep up with the version updates on the API server. So that, it reduces the client with "too old version" error, and reduces the frequency that client relist the object from API Server.

Add Watch bookmarks support

etcd capacity
tair
paper
enhancement
#956
Kubemark
How Does Alibaba Ensure the Performance of System Components in a 10,000-node Kubernetes Cluster?Alibaba Cloud Community
Logo