Reliability Beyond Uptime: Fault Tolerance, Human Error, and the Real Cost of Failure - AlexWebLab in Bangkok, Thailand now, before in Hong Kong 香港

Reliability is often reduced to a dashboard with green checks on it.

That is too small a definition.

Users do not care whether your instances are technically running if the product behaves unpredictably, corrupts data, leaks private information, or quietly makes wrong decisions. A system can be "up" and still be untrustworthy.

This article uses a broader and more realistic framing: a reliable system continues to work correctly even when things go wrong.

That immediately raises a harder question. What counts as "working correctly"?

The answer is bigger than uptime.

Reliability Includes Correctness, Performance, and Abuse Resistance

This article starts with a practical set of expectations people usually have for software:

it does the job users expect,
it tolerates mistakes and unexpected usage,
its performance is good enough for the workload,
it prevents unauthorized access and abuse.

That last item matters more than many architecture discussions admit.

If a system responds quickly but exposes sensitive data, it is not reliable in any meaningful sense. If an application stays online but produces wrong financial results, it is not reliable. If a product works for ideal flows but falls apart when users retry, navigate oddly, or submit duplicate requests, it is not reliable either.

Reliability is therefore a product property as much as an infrastructure property.

A Fault Is Not the Same Thing as a Failure

One of the cleanest distinctions in this article is between a fault and a failure.

A fault is when one component behaves incorrectly.
A failure is when the system as a whole stops providing the required service.

This sounds semantic until you need to design around it.

If one disk in a replicated storage system dies, that disk has faulted. If the overall service continues operating correctly, the product has not failed. If one worker crashes but another takes over, that is a fault tolerated by the larger system.

This is the core of fault tolerance: the ability to absorb certain component-level problems without turning them into user-visible service failure.

That shift in thinking is useful for frontend engineers too. Many UI states are about this distinction.

If one recommendation service is unavailable but checkout still works, the product should degrade gracefully rather than failing wholesale. Good product architecture treats partial failure as normal, not exceptional.

Fault Tolerance Is About Scope and Boundaries

A system is fault-tolerant only with respect to the faults it is actually designed to handle.

That qualification matters. People speak about fault tolerance as if it were a badge. It is really a budgeted promise.

You can tolerate one machine dying. You can tolerate one availability zone failing. You can tolerate a queue backing up for a while. You cannot tolerate every component failing at once, and you definitely cannot tolerate conditions you never modeled.

That is why reliability work starts by naming the specific classes of fault that matter:

disk failures,
machine crashes,
network partitions,
overloaded dependencies,
corrupted responses,
delayed events,
operator mistakes,
security incidents.

Without that specificity, reliability goals stay rhetorical.

Single Points of Failure Are Often Product Choices

A single point of failure is not always a machine. It can be any component whose failure escalates into a system-level outage.

Examples:

one database with no failover,
one queue that everything depends on,
one region serving all traffic,
one authentication dependency with no degraded mode,
one brittle internal service that gates many product flows.

Some single points of failure are acceptable, especially when a system is small and simplicity matters more than redundancy. The problem is not their existence alone. The problem is pretending they do not exist.

If you know which component is a hard stop for the product, you can make an explicit decision about whether that risk is acceptable yet.

Hardware Faults Are Normal at Scale

When people imagine system failure, they often picture something dramatic. In reality, many failures are boring and statistically inevitable.

Disks fail. RAM gets corrupted. CPUs occasionally compute the wrong result. Power supplies die. Entire datacenters can disappear behind power or network incidents.

The point in this article is not that every product must fear apocalypse. It is that hardware unreliability is part of normal system operation once enough machines are involved.

This is why redundancy exists:

replicated data,
redundant power,
failover nodes,
multi-zone placement,
services that can restart elsewhere.

The important nuance is that redundancy works best when failures are independent. Real systems do not always grant that. Hardware faults can be correlated by rack, provider zone, firmware version, or environmental event.

That is one reason "we have replicas" is not the same thing as "we are safe."

Software Faults Are Often More Dangerous Than Hardware Faults

Hardware faults can be correlated, but software faults are often correlated by default.

If many nodes run the same code, then the same bug can take them all down together.

That makes software faults especially nasty:

a bad deploy can break every instance,
a hidden edge case can trigger cascading failures,
a retry loop can overload healthy dependencies,
a bad assumption can remain dormant until traffic, data shape, or timing changes.

This is why distributed systems often fail in ways that look less like "a machine died" and more like "everything followed the same broken rule at once."

The most dangerous failures are often systemic, not local.

Exactly-Once Thinking Usually Hides a Hard Reliability Problem

The social timeline example in this article hints at an issue that shows up everywhere: if one machine is updating derived state and crashes midstream, how do you continue without dropping work or duplicating it?

That is the heart of exactly-once semantics discussions.

In practice, many systems settle for a more pragmatic goal: make duplicate work harmless when possible, and make missing work detectable and repairable when not.

That is why idempotency matters so much.

If a payment callback, queue consumer, or webhook delivery might be retried, the system needs a way to distinguish "process this once" from "I already handled this, do not do it again."

Frontend engineers meet the same problem in smaller form every time a user double-clicks a purchase button or resubmits after a timeout.

Reliability is not only about preventing errors. It is about making ambiguous execution states survivable.

Humans Are Part of the Reliability Model

One of the strongest sections in this article is the discussion of human error.

It rejects the lazy habit of treating incidents as the moral failure of one operator. Real systems are sociotechnical. People work inside tooling, processes, interfaces, incentives, and incomplete information.

That means many "human errors" are really design errors in the surrounding system:

dangerous defaults,
confusing interfaces,
missing rollback paths,
weak observability,
rushed release processes,
organizations that reward feature speed while underfunding resilience.

This framing is more useful because it creates action.

If every outage postmortem ends with "be more careful," the system does not improve. If the conclusion is "we made it too easy to do the wrong thing and too hard to see the blast radius," now you have something architectural to fix.

Blameless Postmortems Are About Learning, Not Niceness

Blameless postmortems are sometimes described so softly that they sound ceremonial.

They are not. Their value is practical.

You need truthful incident information if you want better systems. People will not share that truth if the process is mainly about punishment.

Good postmortems ask questions like these:

What assumptions failed?
What signals were missing?
Which dependency behaved differently from expected?
Which safeguard was absent, too weak, or too hard to use?
Which organizational pressure made the risky path more likely?

That is how reliability work becomes cumulative rather than theatrical.

Security and Privacy Failures Are Reliability Failures Too

This article explicitly includes unauthorized access and abuse in the definition of what users expect from reliable software.

That matters because teams often separate security into its own silo and then discuss reliability as if it only means availability.

That is not how users experience failure.

If a platform leaks personal data, allows account takeover, or preserves deleted information in ways it should not, the system has failed in a way that can be far more damaging than a short outage.

This is where architecture and ethics meet.

Retention policies, deletion propagation, access controls, audit trails, and data minimization are not legal decorations. They are part of building a system that behaves correctly under real-world obligations.

Reliability Has Social Consequences

This article's discussion of unreliable software harming people is not a rhetorical flourish. It is a reminder that software errors can become legal, financial, and human harm.

The Post Office Horizon scandal is a particularly important example because it shows how software output can be treated as institutional truth long after the system itself should have been under suspicion.

That changes how we should talk about correctness.

Reliable software is not only about preserving revenue or uptime. In some domains, it is about avoiding wrongful accusations, broken records, lost medical context, or serious privacy violations.

This is why "move fast" is not a universal virtue. Some systems should absolutely trade speed of delivery for stronger verification and safer recovery paths.

What Frontend and JavaScript Engineers Should Take Seriously

Even if you are not designing storage replicas or failover clusters, the reliability model still reaches you.

You shape it when you decide:

whether duplicate submissions are possible,
whether degraded states are comprehensible,
whether destructive actions are reversible,
whether stale or partial data is labeled clearly,
whether retry behavior is safe,
whether the product hides uncertainty or surfaces it honestly.

That is all reliability work.

The UI is often where the consequences of ambiguity first become visible. A system that cannot tell whether an operation completed should not fake certainty. A system with delayed consistency should not present every screen as if all data updates propagate instantly.

Conclusion

Reliability is bigger than uptime and bigger than infrastructure.

It includes correctness, performance under expected load, resilience to faults, resistance to abuse, and the ability of humans to operate and recover the system safely. Hardware fails. Software fails in correlated ways. People make mistakes inside systems that may be poorly designed to support them.

The engineering job is not to pretend those things can be eliminated. It is to design products and systems that continue behaving acceptably when they inevitably happen.