Performance Under Load: Latency, Throughput, Percentiles, and Why Fast Systems Suddenly Fail - AlexWebLab in Bangkok, Thailand now, before in Hong Kong 香港

Performance discussions get vague very quickly.

Teams say things like "the API is slow," "the database is under load," or "we need better performance" as if all three statements point to the same problem. They usually do not.

This article makes a more useful distinction: performance is a relationship between how long work takes, how much work the system is handling, and what happens as demand approaches the system's limits.

That sounds abstract until you put it in a product context.

Imagine a social app where each user expects their home feed to update within a few seconds after someone they follow posts. The first naive design is obvious: every time the client refreshes, query recent posts from everybody the user follows, sort them, and return the top results.

That can work when the product is small. Under heavier load, the cost becomes brutal. A read path that felt simple at first now means repeated fan-in queries, constant polling, and an explosion of database work just to keep one screen feeling fresh.

The lesson is broader than feeds: many performance problems are really architecture problems about access patterns, queueing, and load shape.

Response Time and Throughput Are Different Questions

Two metrics drive most performance conversations:

response time, meaning how long one request takes from the user's perspective,
throughput, meaning how many requests, jobs, or units of work the system can process per second.

Those metrics are related, but they are not interchangeable.

A system can have acceptable response time at low traffic and still collapse as throughput increases. A system can also push a lot of total work through the hardware while still producing miserable user experience for the slowest requests.

That distinction matters to frontend engineers too.

If a page load usually completes in 180 milliseconds but occasionally spikes to 4 seconds, users do not experience "average performance." They experience the request that just happened to them.

That is why product performance work should always ask two questions:

How fast is one request?
What happens to that speed as more requests arrive?

If you only ask the first question, you miss the actual failure mode.

Queueing Is Why Systems Fail Gradually and Then All at Once

The most important curve in this part of the article is not exotic. It is the basic idea that response time stays relatively calm until throughput approaches system capacity, and then latency rises sharply because requests begin waiting.

That waiting time is queueing delay.

Even if the service time for one request has not changed much, the total response time can climb because work is sitting in line behind other work.

request arrives
  -> waits for CPU, worker, DB connection, or downstream dependency
  -> starts active processing
  -> waits again on network or another service
  -> returns to user

This is one reason local benchmark numbers mislead teams. A handler may look efficient in isolation. Under realistic traffic, the request is not alone. It competes for workers, database connections, network bandwidth, cache capacity, and downstream rate limits.

The system becomes slow not because the code forgot how to execute, but because too many requests are contending for the same finite path.

Overload Is Often Amplified by Users and Retries

When a system gets close to its limit, it can enter a nasty feedback loop.

Requests slow down. Clients time out. Retries fire. The system receives even more work than before. Recovery gets harder precisely because the service is already struggling.

That is why overloaded systems sometimes do not recover smoothly when load drops again. They can remain stuck in a degraded state until queues drain, workers restart, or operators intervene.

This matters on the web because frontend behavior is often part of the overload story.

Examples:

a client retries automatically on a short timeout,
several components independently refetch the same data,
a background polling loop keeps running during degraded conditions,
optimistic UI triggers a second write because the first response looked lost.

Each behavior can be reasonable in isolation. Together, they can turn a slow system into a self-inflicted denial of service.

The useful question is not "should we retry?" It is "what retry pattern helps users without multiplying stress on a service that is already failing?"

That is why techniques like exponential backoff, jitter, circuit breakers, load shedding, and backpressure are not backend trivia. They are ways of preventing recovery from becoming harder than the original incident.

Latency Is Not the Same Thing as Response Time

Teams often use latency and response time loosely. The distinction in this article is more precise and more helpful.

response time is the full time the client sees,
service time is the time the system is actively working,
latency often refers to time spent not actively processing,
network latency is specifically travel time across the network,
queueing delay is waiting before work can continue.

That breakdown explains why a request can feel slow even when the business logic itself is not particularly expensive.

For browser applications, this is a recurring trap. Engineers profile a resolver or route handler and find that the core computation is fast, yet the user still waits. The missing time is often somewhere else:

TLS and connection setup,
congested network links,
a queue inside an API gateway,
a saturated database connection pool,
a slow downstream call,
head-of-line blocking caused by a few outlier requests.

This is why it is dangerous to reduce performance work to one flamegraph or one local benchmark. You need to know where the request spent its time, not just which function consumed CPU.

Averages Hide the Performance Story Users Actually Feel

If response times vary from request to request, then performance is a distribution, not a single number.

The average is useful for some capacity estimates, but it is a weak description of user experience.

Suppose 95 requests return in 120 milliseconds and 5 requests return in 2.8 seconds. The average may still look respectable. That does not mean the product feels respectable.

This is why percentiles matter so much.

p50 tells you what the median experience looks like,
p95 tells you how bad the slowest common requests are,
p99 tells you how ugly the tail gets,
p99.9 can matter for very sensitive systems, though chasing it too aggressively can become disproportionately expensive.

Percentiles change how teams think.

Instead of asking, "how fast is the service on average?" you ask, "how many users are being exposed to the slow path, and how slow is that path?"

That is a much more honest product question.

Tail Latency Gets Worse in Composed Systems

One slow dependency can dominate an entire user request.

If an end-user action fans out into several backend calls, the request often has to wait for the slowest of them. This means even a small rate of slow sub-requests can produce a much larger rate of visibly slow end-user experiences.

That effect matters everywhere modern web applications aggregate data:

dashboards pulling from several internal services,
server-rendered pages stitching together multiple APIs,
personalization layers calling search, profile, and recommendation systems,
checkout flows waiting on fraud, tax, inventory, and payment dependencies.

This is one reason architecture diagrams with many apparently clean layers can still produce sloppy UX. The tail compounds.

One dependency at p99 can become an entire page at p95 or worse once enough parallel or serial work is chained together.

Materialized Views Are a Performance Trade, Not a Magic Trick

The social feed example in this article illustrates a classic move: stop computing the entire read result from scratch every time, and instead precompute some of the read model.

That is materialization.

Instead of assembling the home timeline on each request, you can push new posts into follower-specific timelines ahead of time and then serve reads from a cache-like, precomputed structure.

That buys you faster reads, but not for free.

You are exchanging read-time cost for write-time cost.

The moment you do that, several trade-offs appear:

writes now fan out to more places,
popular accounts can generate extreme update pressure,
the system needs to tolerate lag or partial catch-up under spikes,
derived data becomes another thing that can go stale or fail to update cleanly.

This is a pattern frontend engineers feel constantly even if they never use the term materialized view.

Any time a product surface reads from a precomputed index, a cached projection, or a delayed analytics store, the UI is living on a performance trade:

faster reads,
but more complexity around freshness and consistency.

That should influence copy, interaction design, and user expectations.

SLOs Need Metrics That Match User Experience

This section also explains why percentiles show up in service level objectives.

A performance target like "average latency under 200 milliseconds" sounds disciplined, but it can still hide a large number of bad experiences. An SLO such as "p99 under 1 second" is often closer to how users actually feel the system.

The exact target depends on the product, but the principle is stable: the metric should reflect the thing you truly care about.

If the product requirement is that users should not wait excessively for page data, then the performance target should not be a number that remains healthy while the tail is burning.

This also applies to frontend monitoring.

If you only track a mean API duration or a mean page load time, you can talk yourself into thinking the system is healthy while a nontrivial portion of users keeps getting the bad path.

Measuring Percentiles Requires the Right Aggregation Model

One subtle but important warning in this article is that percentiles are not something you can average together across machines or time buckets and still trust.

If one server sees very different behavior from another, or one minute is much worse than the next, averaging percentile values can flatten away the truth.

That is why teams use streaming or histogram-based approaches to estimate percentiles over rolling windows.

The practical lesson is simple: the way you aggregate metrics changes the story you think the system is telling.

Bad aggregation can make bad performance look acceptable.

What Frontend and JavaScript Engineers Should Do With This

You do not need to run the databases to apply these ideas.

You can use them by changing the questions you ask:

Is this page slow because active work is expensive, or because the request spends too much time waiting?
Are we optimizing for the average while ignoring the tail?
Could retries, polling, or duplicated fetches be worsening overload?
Is this read path expensive because we compute it live every time?
Which user-facing promise depends on stale or precomputed data?

Those questions move performance work out of generic complaint mode and into system diagnosis.

They also improve product honesty. If a dashboard updates hourly, say so. If a search result is eventually consistent, design around that. If a page assembles data from six services, assume the tail matters.

Conclusion

Performance is not one number and not one bottleneck.

It is the interaction between response time, throughput, queueing, tail behavior, retries, and the architectural shape of the work itself. Systems often seem fast right up until contention, overload, or composition makes them suddenly feel broken.

That is why good performance engineering is not just about making one query faster. It is about understanding what kind of work the system is doing, how it behaves near its limits, and which metrics actually reflect the experience users live through.