Lead/Senior - Application Reliability Decision Making
Concept-focused guide for Lead/Senior - Application Reliability Decision Making
~13 min read

Overview
You’re being tested on how to think about reliability, not just memorize buzzwords. This quiz revolves around defining reliability precisely, distinguishing faults from failures, and choosing the right engineering response: prevent, tolerate, detect, mitigate, and recover. You’ll also practice decision-making under real constraints—speed vs safety, simplicity vs control, and cost vs resilience. By the end, you should be able to look at an incident pattern (slowdowns, bad data, crashes, correlated outages) and immediately map it to the right reliability strategy.
Concept-by-Concept Deep Dive
Reliability as “Correct Service Over Time” (Not Just “No Crashes”)
- What it is (2–4 sentences). Reliability means a system continues to deliver correct behavior over time under expected conditions—and degrades predictably under unexpected ones. “Correct” includes functional correctness (right results), availability (service is reachable), and often performance (it responds within an acceptable time). Reliability is therefore a user-centered promise: the user’s workflow works, consistently, when they need it.
What “correct” really includes
- Functional correctness: The system produces the right outputs for the right inputs (including rejecting invalid inputs safely).
- Availability/reachability: Users can access the system when they try.
- Performance as part of correctness: A feature that “works” but times out or becomes unusably slow is often effectively broken from the user’s perspective.
- Data correctness over time: Reliability includes not corrupting history (e.g., analytics or financial records) and being able to repair it if corruption occurs.
Step-by-step reasoning recipe (for reliability definition questions)
- Identify the user expectation being discussed (workflow, data, latency, uptime).
- Translate it into a measurable property (error rate, tail latency, freshness, correctness checks).
- Decide whether the scenario is about steady-state correctness or behavior under fault (degradation, recovery).
- State the reliability implication: “Users lose trust / workflows break / downstream systems misbehave.”
Common misconceptions and how to fix them
- Misconception: “If the app doesn’t crash, it’s reliable.”
Fix: Include wrong results, slow responses, and partial failures as reliability failures. - Misconception: “Noncritical apps don’t need reliability.”
Fix: Reliability still affects trust, retention, support load, brand damage, and hidden criticality (photos, identity, receipts, memories). - Misconception: “Reliability = 100% uptime.”
Fix: Reliability is broader: correctness + availability + performance + recoverability.
Faults vs Failures: The Causal Chain You Must Keep Straight
- What it is (2–4 sentences). A fault is a defect or abnormal condition (bug, misconfiguration, disk error, bad dependency response). A failure is when the system’s delivered service deviates from what users expect (errors, incorrect results, timeouts, outage). Faults are causes; failures are observed outcomes.
A useful mental model: “latent → activated → propagated”
- Latent fault: Exists but hasn’t mattered yet (a bug in rarely used code).
- Activation/trigger: A condition makes the fault matter (leap second, rare input, traffic spike).
- Propagation: The fault spreads through dependencies or shared resources.
- Failure: Users see errors, slowness, or wrong data.
Step-by-step reasoning recipe (fault vs failure questions)
- Ask: is this describing a condition inside the system (fault) or user-visible deviation (failure)?
- If it’s internal, ask whether it’s systematic (reproducible) or random (stochastic hardware-like).
- Map the described event to the chain: fault → trigger → failure.
Common misconceptions and how to fix them
- Misconception: “A fault is the same as a failure.”
Fix: A fault can exist with no failure until triggered; a failure can occur without identifying the fault immediately. - Misconception: “If we add redundancy, faults disappear.”
Fix: Redundancy doesn’t remove faults; it reduces the chance a fault becomes a user-visible failure.
Systematic vs Random Faults, Correlation, and “Common Cause”
- What it is (2–4 sentences). Some faults are uncorrelated (one disk fails; others are independent), while others are systematic/correlated (same bug hits every node under the same condition). Correlation is the enemy of naive redundancy: many copies don’t help if they fail together. “Common cause” refers to shared factors that create correlation (same kernel version, same rack power, same dependency, same deploy).
Uncorrelated hardware faults (often modeled as independent)
- Think: individual component wear-out, random bit flips, isolated device failure.
- Independence is an approximation that makes reliability math workable—but it breaks when shared conditions exist.
Systematic faults (reproducible, often software/config)
- Same input + same code path = same failure on every machine.
- Examples include timekeeping edge cases, deterministic bugs, bad config pushed everywhere, incompatible protocol changes.
Common-cause correlation (the subtle middle)
- Not perfectly systematic, but not independent either: shared environment increases joint failure probability.
- Examples: same availability zone, shared power feed, shared network switch, shared image/AMI, shared certificate authority, shared dependency rate limit.
Step-by-step reasoning recipe (correlation questions)
- Identify whether the cause is shared across many instances.
- If shared, assume correlated failure risk (redundancy is weaker).
- Decide mitigation: diversity (different versions/regions), isolation (fault domains), staged rollout, circuit breakers, bulkheads.
Common misconceptions and how to fix them
- Misconception: “With 200 servers, failures should scale linearly and predictably.”
Fix: Large fleets often reveal non-independence: shared dependencies, shared software, shared time events. - Misconception: “Redundancy always solves reliability.”
Fix: Redundancy helps random faults; systematic faults require prevention, detection, and safe rollout.
Fault Tolerance, Redundancy, and Isolation: Reducing Blast Radius
- What it is (2–4 sentences). Fault tolerance is the ability to keep providing acceptable service even when parts fail. In practice, you don’t “tolerate faults” so much as tolerate failures of components by detecting issues, routing around them, and limiting impact. The core idea is blast-radius control: keep a fault local so it can’t take everything down.
Component-level redundancy vs machine/service-level tolerance
- Component-level redundancy: Duplicate a component inside a machine or subsystem (e.g., redundant disks, redundant power supplies). Helps when the machine stays up but a part fails.
- Machine/service-level tolerance: Multiple nodes, load balancing, health checks, failover. Helps when an entire instance disappears or needs rebooting (patching, kernel panic, cloud preemption).
Isolation patterns (prevent “noisy neighbor” failures)
- Resource isolation: quotas/limits (CPU, memory), container cgroups, per-tenant limits.
- Bulkheads: partition resources so one failing area can’t sink the whole ship.
- Backpressure: slow producers when consumers are overloaded.
- Circuit breakers/timeouts: stop waiting forever on a degraded dependency.
Step-by-step reasoning recipe (design choice questions)
- Determine the failure mode: component dies, machine disappears, dependency slows/corrupts, resource leak.
- Choose the scope of redundancy: inside the box vs across boxes vs across regions.
- Add detection (health checks, SLO-based alerts) and routing (failover/load balancer).
- Add isolation to prevent cascading (limits, timeouts, queues).
Common misconceptions and how to fix them
- Misconception: “Fault-tolerant means it never fails.”
Fix: It means it can fail without catastrophic user impact within defined assumptions. - Misconception: “Single-server + backups is equivalent to multi-node.”
Fix: Backups help data recovery, not continuous availability; reboot/patching causes downtime without redundancy.
Safe Change Management: Rollouts, Rollbacks, and Human Error as a Primary Risk
- What it is (2–4 sentences). Many outages come from changes: config edits, deploys, migrations, permission tweaks, operational actions.
🔒 Continue Reading with Premium
Unlock the full vlog content, professor narration, and all additional sections with a one-time premium upgrade.
One-time payment • Lifetime access • Support development
Join us to receive notifications about our new vlogs/quizzes by subscribing here!