Real Incident Walkthroughs

advanced17 min read

Debugging Is a Skill, Not a Talent

Junior engineers stare at code hoping the bug reveals itself. Senior engineers follow a process — and they follow it even when they're stressed, tired, and getting paged at 2am. This chapter walks through four real-world production incidents and shows you the exact investigation steps, the dead ends, and the moment the root cause clicks.

Mental Model

Every debugging session follows the same loop: Observe → Hypothesize → Test → Narrow. You observe a symptom (slow page, growing memory, visual glitch). You form a hypothesis about the cause. You test it with a specific DevTools action. If confirmed, you narrow further. If disproven, you form a new hypothesis. The skill is in forming good hypotheses quickly and testing them with the right tool.

Incident 1: Sudden INP Regression

Symptoms

Friday deploy. Monday morning, the INP monitoring dashboard is on fire — 75th percentile jumped from 120ms to 480ms. Users report that clicking buttons feels "laggy." The regression affects all pages, not just one route. Classic "what did we ship?" scenario.

Investigation

Step 1: Reproduce locally. Open the app in Chrome with CPU throttling at 4x. Click a few buttons. The lag is obvious — there's a visible delay between click and response.

Step 2: Record a Performance trace. Click Record, click a button, wait for the response, click Stop. Zoom into the click event on the Main thread.

Step 3: Read the flame graph. The click event fires a Task that's 380ms long — marked with the red long-task corner. Inside the task:

┌──────────── Task (380ms) ────────────────────┐
│ ┌─ Event: click (375ms) ───────────────────┐  │
│ │ ┌─ handleClick (370ms) ────────────────┐ │  │
│ │ │ ┌─ validateForm (8ms)─┐              │ │  │
│ │ │ ├─ trackAnalytics (355ms) ──────────┐│ │  │
│ │ │ │ ┌─ JSON.stringify (340ms) ───────┐││ │  │
│ │ │ │ └──────────────────────────────┘ ││ │  │
│ │ │ └─────────────────────────────────┘│ │  │
│ │ └───────────────────────────────────┘ │  │
│ └──────────────────────────────────────┘   │
└────────────────────────────────────────────┘

Step 4: Identify the bottleneck. trackAnalytics takes 355ms, and inside it, JSON.stringify takes 340ms. That's 89% of the entire interaction time spent serializing data for analytics.

Step 5: Check the code. The Friday deploy added a new analytics event that serializes the entire Redux store on every click:

// The offending code — added in Friday's deploy
function trackAnalytics(event) {
  const payload = {
    event,
    timestamp: Date.now(),
    state: JSON.stringify(store.getState()), // Serializing entire store!
  };
  navigator.sendBeacon('/analytics', JSON.stringify(payload));
}

The Redux store contains thousands of entities (products, users, cached API responses). Serializing it on every click takes 340ms.

Root Cause

The analytics tracking function serializes the entire application state on every user interaction. The store grew large enough that JSON.stringify became a performance cliff.

Fix

Only serialize the relevant slice of state needed for the analytics event:

function trackAnalytics(event) {
  const state = store.getState();
  const payload = {
    event,
    timestamp: Date.now(),
    route: state.router.pathname,
    userId: state.user.id,
    // Only the fields analytics actually needs
  };
  navigator.sendBeacon('/analytics', JSON.stringify(payload));
}

Post-fix INP: 95ms. Back to normal.

Quiz

In this incident, what was the key signal in the flame graph that pointed to the root cause?

ABCD

Incident 2: Memory Leak Growing Over Hours

Symptoms

Customer support escalates a ticket: the admin dashboard becomes "frozen" after being open for 4-6 hours. The only fix is a hard refresh. No errors in the Console — it just silently degrades. Chrome Task Manager reveals the tab is eating 2.1 GB of memory (baseline is ~150 MB). That's a 14x increase.

Investigation

Step 1: Confirm the leak. Open Task Manager (Shift+Esc). Watch the "JavaScript Memory" column while navigating between dashboard pages. Memory increases with each navigation and never decreases — even after forcing GC.

Step 2: Three-snapshot technique. Snapshot 1 (baseline: 150 MB). Navigate to the analytics page, then back to the dashboard. Snapshot 2 (165 MB). Repeat the navigation. Snapshot 3 (180 MB).

Step 3: Compare Snapshot 3 vs Snapshot 2. Sort by # Delta. Results:

Constructor	# New	# Deleted	# Delta
Array	1,247	0	+1,247
Object	3,891	42	+3,849
Detached HTMLDivElement	312	0	+312
IntersectionObserver	1	0	+1

Step 4: Inspect the Detached DOM nodes. Click on Detached HTMLDivElement. Expand one instance. The Retainer chain:

Detached HTMLDivElement @456789
  ← entries[0].target in IntersectionObserverEntry
    ← callback scope in IntersectionObserver @123456
      ← observers in Set @789012
        ← (GC root: IntersectionObserver registry)

Step 5: Find the leaking code. The analytics page creates an IntersectionObserver to track which charts are visible. But the observer is never disconnected when the page unmounts:

// Analytics page component
useEffect(() => {
  const observer = new IntersectionObserver((entries) => {
    entries.forEach(entry => {
      if (entry.isIntersecting) {
        trackChartView(entry.target.dataset.chartId);
      }
    });
  });

  document.querySelectorAll('.chart').forEach(el => observer.observe(el));

  // Missing: return () => observer.disconnect();
}, []);

Root Cause

The IntersectionObserver is never disconnected. Each navigation to the analytics page creates a new observer that holds references to all observed DOM elements. The DOM elements are removed from the document but retained by the observer — creating detached DOM trees that grow with every navigation.

Fix

useEffect(() => {
  const observer = new IntersectionObserver((entries) => {
    entries.forEach(entry => {
      if (entry.isIntersecting) {
        trackChartView(entry.target.dataset.chartId);
      }
    });
  });

  document.querySelectorAll('.chart').forEach(el => observer.observe(el));

  return () => observer.disconnect(); // Cleanup!
}, []);

Post-fix: memory stabilizes at ~160 MB regardless of how many times you navigate.

Quiz

In this incident, what would have happened if you only took two snapshots instead of three?

ABCD

Incident 3: Layout Shift Causing CLS Failure

Symptoms

Lighthouse reports CLS of 0.42 (target: < 0.1). That's 4x over budget. Users on slower connections report that content "jumps around" while the page loads, and the marketing team is frustrated because the hero section appears to "bounce." Three different problems conspiring to make one terrible experience.

Investigation

Step 1: Enable Layout Shift Regions. In DevTools → Rendering panel, check "Layout Shift Regions." Reload the page on "Fast 3G" throttling. Blue rectangles flash over the hero section — the text and CTA button shift down when something loads above them.

Step 2: Record a Performance trace on reload with "Fast 3G". In the Timings track, look for "Layout Shift" markers. Three layout shifts appear:

Shift 1 (0.15) at 1.2s — hero text moves down
Shift 2 (0.18) at 2.1s — product cards move right
Shift 3 (0.09) at 3.4s — footer content shifts

Step 3: Click the first Layout Shift marker. DevTools highlights the shifted elements and shows the shift source. The hero text shifts because a web font loads and changes the text metrics.

Step 4: Click the second Layout Shift marker. Product cards shift because an ad banner above them loads asynchronously — it has no reserved height, so it pushes content down when it appears.

Step 5: Click the third Layout Shift marker. Footer content shifts because images in the product grid don't have explicit width/height attributes. As each image loads, it pushes subsequent content down.

Root Cause

Three independent CLS sources: (1) Font swap causing text reflow, (2) dynamically injected ad banner without reserved space, (3) images without dimensions.

Fixes

/* Fix 1: Font loading — use font-display: optional to prevent layout shift */
@font-face {
  font-family: 'Brand';
  src: url('/fonts/brand.woff2') format('woff2');
  font-display: optional; /* Uses fallback if font hasn't loaded — zero shift */
}

/* Fix 2: Reserve ad banner space */
.ad-banner-slot {
  min-height: 90px; /* Match the ad creative height */
  contain: layout;  /* Prevent layout changes from propagating */
}

<!-- Fix 3: Explicit dimensions on images -->
<img src="product.jpg" width="400" height="300" alt="Product" />
<!-- Browser reserves 4:3 aspect ratio space before image loads -->

Post-fix CLS: 0.03. Well under the 0.1 target.

Quiz

Why is font-display: optional better than font-display: swap for preventing CLS?

ABCD

Incident 4: Slow Page Load — Blocking Chain

Symptoms

A product detail page takes 6.2 seconds to load on mobile (target: 2.5s). Lighthouse performance score: 34 — brutal. The page looks blank for the first 4 seconds. Users are bouncing before they ever see content.

Investigation

Step 1: Network waterfall with throttling. Record with "Fast 3G" and cache disabled. The waterfall reveals a catastrophic sequential chain:

0s     1s     2s     3s     4s     5s     6s
├──HTML (600ms)──┤
                 ├──main.css (800ms)──┤
                                      ├──app.js (1200ms, sync)──┤
                                                                ├──vendor.js (900ms, sync)──┤
                                                                                             ├──data fetch (700ms)──┤
                                                                                                                    ├──render──┤

Every resource is sequential. Nothing loads in parallel. Total critical path: 4.2 seconds before any data is fetched.

Step 2: Analyze each request.

HTML: 600ms — TTFB is 480ms (slow server, no CDN)
main.css: 800ms — 420 KB uncompressed (no code splitting, includes unused styles)
app.js: 1200ms — 1.8 MB bundle, loaded synchronously in <head>
vendor.js: 900ms — 1.2 MB, also synchronous in <head>, discovered only after app.js

Step 3: Check the Initiator column. vendor.js is loaded by a document.createElement('script') inside app.js — a dynamic script injection that creates an artificial dependency chain.

Step 4: Check priorities. The LCP image (product photo) has "Low" priority — the browser deprioritizes it behind all the blocking JavaScript.

Root Cause

Four compounding issues: (1) no CDN for the HTML document, (2) unoptimized CSS bundle, (3) synchronous JS blocking rendering, (4) chained script loading via dynamic injection.

Fixes

<!-- Before: everything blocks everything -->
<head>
  <link rel="stylesheet" href="/main.css" />
  <script src="/app.js"></script>
</head>

<!-- After: parallelized, non-blocking -->
<head>
  <!-- Preconnect to CDN -->
  <link rel="preconnect" href="https://cdn.example.com" />

  <!-- Critical CSS inlined -->
  <style>/* above-fold styles only — 8KB */</style>

  <!-- Full CSS loaded asynchronously -->
  <link rel="stylesheet" href="/main.css" media="print" onload="this.media='all'" />

  <!-- Scripts deferred — don't block parsing -->
  <script src="/app.js" defer></script>
  <script src="/vendor.js" defer></script>

  <!-- Preload LCP image -->
  <link rel="preload" as="image" href="/product-hero.webp" fetchpriority="high" />
</head>

Additional fixes: deploy HTML to CDN edge (TTFB drops from 480ms to 60ms), enable Brotli compression (CSS drops from 420KB to 78KB), code-split vendor.js into the main bundle.

Post-fix load time: 1.8 seconds. LCP: 1.4 seconds.

Quiz

In this incident, which single fix would have the largest impact on time-to-first-paint?

ABCD

Key Rules

1Every debugging session follows Observe → Hypothesize → Test → Narrow. Form specific hypotheses and test them with the right tool.
2In flame graphs, follow the widest bars to the deepest leaf — that's where Self Time concentrates and where optimization has the most leverage.
3The three-snapshot technique isolates per-action leaks from one-time allocations. Always force GC between snapshots.
4Layout Shift markers in the Performance timeline point to the exact elements and the exact moment of each shift.
5Network waterfall blocking chains are the #1 cause of slow page loads. Flatten chains with preload, defer, and parallel loading.
6Always reproduce with realistic conditions: CPU throttling at 4x, network throttling at Fast 3G, cache disabled.

Interview Question

Q: You're paged for a production incident: users report the page is "slow." You have no other information. Walk me through your triage process.

A strong answer: (1) Clarify "slow" — is it slow to load (Network/LCP), slow to interact (Performance/INP), or slow over time (Memory/leak)? (2) For slow loading: open Network panel with Fast 3G throttling, read the waterfall, find the critical blocking chain, check TTFB and priorities. (3) For slow interactions: record a Performance trace with CPU throttling, find long tasks in the flame graph, identify the heaviest leaf functions. (4) For degradation over time: check Task Manager for growing memory, use three-snapshot comparison to find leaked objects. (5) In each case: fix the root cause, verify with the same tool, deploy, and monitor the dashboard to confirm the fix holds in production.