Scalability

Describe load

Describe performance

Latency and Response time are different

The response time is what client see, it includes processing time, network delays and queueing delays.

The latency is the duration that the request waits to be handled.

Using percentile to measure the performance

Percentile: p50 is also known as median. If the p50 response time is 200ms, then you know there are half of the requests are slower than that. In production, there are usually p99, p999 indicates the 99th, 99.9th percentile. If the response time is at p99, then we know there are 99% of the requests are faster than the threshold. Higher percentile is also known as tail latency.

SLO and SLAs: It is expensive to fix that 1% slow requests, since the root cause of that slowness could be random. So most of the time, SLO and SLAs defines the expected performance.

Head of line blocking: If a server can only process a small number of requests, and a small number of slow requests slow down the entire performance, even the subsequent requests are fast to be processed.

Where to measure the performance?

Server side

  • Measure the request processing time (latency)

Client side:

  • Due to the head of line blocking, it is important to also measure the performance from client side.

  • Client needs to keep sending requests independently. (Do not wait previous request to complete)

How to measure in practice

Keep a rolling window of response times of requests in a period of time. (Sliding window)

there are algorithms that can calculate a good approximation of percentiles at minimal CPU and memory cost, such as forward decay, t-digest, or HdrHistogram.

Last updated