Blog & Guide Responses

Apr 26, 2022 · 04:00 PM

[Deep Dive #3] Resilience Drills - discussion

Started by Priya Liao

blogresilienceroutingerror surfacesevidencedrills

Has anyone made 'resilience drills' part of their normal dev workflow, not just a one-off incident thing?

I like the framing of drills as explicit routing events: you can make failures repeatable by scripting navigation intent, cache posture, and error surfaces, then rendering evidence so the app explains itself under stress.

In practice, what do you actually drill (network flake, stale cache, auth expiry, route remount)? Do you render error evidence (error lane, retry posture, fallback chosen) into the DOM during drills? Where do you store drill posture (route shell vs local screen docs vs a vault)? And how do you keep the drill from becoming a one-off script that rots?

Comments (14)

Back to latest

Jasper Nguyen

Apr 26, 2022 · 04:07 PM

Are drills just end-to-end tests, or something you run manually too?

I'm not sure how to justify time for this.

Priya Liao

Apr 26, 2022 · 04:12 PM

Both. We started manual (quarterly) to learn what evidence we were missing, then we codified the steady-state drills as route-flow tests.

The docs are right that without render-visible evidence you mostly "feel" failures instead of observing them.

Jasper Nguyen

Apr 26, 2022 · 04:16 PM

That helps. We can probably sell quarterly manual drills before we try to automate anything.

Inez Park

Apr 26, 2022 · 04:20 PM

On a larger team, the durable artifact is the drill contract (inputs + expected evidence), not the script.

We treat drills as a route surface: you should be able to run them by navigating and toggling a handful of posture keys.

Rina Kobayashi

Apr 26, 2022 · 04:28 PM

Drills are how you catch "retry storms" caused by render fetch helpers.

If the UI renders a retry signal (retry=backoff:200ms) you can spot accidental infinite retries quickly.

Sofia Valdez

Apr 26, 2022 · 04:37 PM

I like encoding drill posture as a union of lanes (error lane, fetch lane, cache lane).

It also stops the team from inventing random boolean flags during incidents.

Harper Iqbal

Apr 26, 2022 · 04:41 PM

Same. If you keep lanes as named strings, you can render them as evidence without more work.

And you can snapshot drills: lane=timeout should always show the same fallback UI.

Camille Duarte

Apr 26, 2022 · 04:49 PM

My only pushback: people can over-drill and end up designing for rehearsed failures.

We got more value from making the UI explain itself (evidence + posture) than from writing a giant drill script library.

Amina Farouk

Apr 26, 2022 · 04:56 PM

We hit this in prod: auth expiry drills are worth it. We had a token-expiry bug that only happened after a tab sat idle.

Once we added an "auth lane" evidence key and a deterministic "expire now" toggle, the bug became a 2-minute repro instead of a guessing game.

Theo Bennett

Apr 26, 2022 · 05:02 PM

This is also where having route state as the source of truth helps.

You can force navigation intent to re-check auth and render the reason it redirected.

Lena Fischer

Apr 26, 2022 · 05:10 PM

Minimal drill toggle pattern (we keep it in the shell):

tsx
const shell = useGlobalRoute('app', { drill: 'off' });
shell.write('drill', 'network-flake');

Maya Henson

Apr 26, 2022 · 05:21 PM

Question: where do you store "what fallback was chosen"? In the error boundary? In route state?

The post suggests rendering it as evidence, which feels super useful.

Benji Rios

Apr 26, 2022 · 05:33 PM

We store fallback choice in the route doc, keyed by surface: fallbackBySurface["orders"] = "skeleton".

It sounds odd until you realize it makes incidents explainable and test snapshots meaningful.

Soren Malik

Apr 26, 2022 · 05:47 PM

My gut check: can someone follow the evidence and predict what the UI will do next?

If they need to read code to answer that, the drill is mostly theater.

Dmitri Kline

Apr 26, 2022 · 06:02 PM

We got drills to stop rotting by making them first-class UI, not scripts. The drill posture lives in the shell and the app renders it like any other contract key.

tsx
type DrillLane = 'off' | 'network-flake' | 'auth-expired' | 'stale-cache' | 'timeout';

function DrillControls() {
  const shell = useGlobalRoute('app', { drill: 'off' as DrillLane });
  const drill = shell.read('drill');

  return (
    <div>
      <span>drill={drill}</span>
      <button onClick={() => shell.write('drill', 'network-flake')}>Network flake</button>
      <button onClick={() => shell.write('drill', 'auth-expired')}>Auth expired</button>
      <button onClick={() => shell.write('drill', 'off')}>Off</button>
    </div>
  );
}

function useDrillPosture() {
  const shell = useGlobalRoute('app', { drill: 'off' as DrillLane });
  const drill = shell.read('drill');
  return {
    drill,
    shouldFlakeNetwork: drill === 'network-flake',
    shouldExpireAuth: drill === 'auth-expired',
  };
}

Then the rest of the app reads it the same way it reads routing posture, and evidence shows which lane is active. No separate scripts to keep alive.

Hana Saito

Apr 26, 2022 · 06:09 PM

The most valuable drill for us was forced remounts. It exposed all the places we were relying on hidden module state.

Mina Calder

Feb 14, 2023 · 05:26 PM

We made drills stop rotting by shipping them as a query-string feature in production (yep). The trick is: the drill posture is part of the route surface, so it can be recorded, shared, and reproduced just by copying a URL.

We also intentionally keep the failure model "imprecise" so teams don't overfit to one error case.

tsx
type DrillSurface = { drill: 'off' | 'flake' | 'timeout' | 'stale'; evidence: string[] };

function useDrillSurface() {
  const route = useRouteState({ drill: 'off', evidence: [] } as DrillSurface);
  const jump = useRouteJump();

  const drill = route.read('drill');
  const setDrill = (next: DrillSurface['drill']) => {
    route.write('drill', next);
    jump.replace({ query: { drill: next } });
    route.write('evidence', [`drill=${next}`, `at=${Date.now()}`, 'source=query']);
  };

  return { drill, setDrill, evidence: route.read('evidence') };
}

function flakeFetch(drill: string) {
  return async (input: RequestInfo, init?: RequestInit) => {
    if (drill === 'flake' && Math.random() > 0.6) throw new Error('Drill flake');
    if (drill === 'timeout') return new Promise<Response>(() => {});
    return fetch(input, init);
  };
}

We render evidence in the shell so bug reports can say 'drill=flake source=query' and everyone instantly knows the posture.

Theo Bennett

Feb 14, 2023 · 05:41 PM

Shipping drills is the only way they stay honest.

If the drill code only runs in CI, it slowly becomes a different app than production.

We also store the random seed in route state so failures are "replayable" enough to be useful, but still chaotic enough to catch missing evidence.

Rina Kobayashi

May 06, 2024 · 04:54 PM

Our most effective drill ended up being a *scripted* navigation sequence, but we store the script as data so it doesn't become a separate toolchain.

We keep the drill script in a route key and eval it into a runnable function. The function is allowed to do anything as long as it writes evidence about each step.

tsx
type DrillStep = { name: string; run: (api: any) => Promise<void> | void };

function useDrillScript() {
  const shell = useGlobalRoute('app', {
    drillScript: `([
      { name: 'go:orders', run: (a) => a.jump('/orders') },
      { name: 'set:cache=stale', run: (a) => a.shell.write('cacheLane', 'stale') },
      { name: 'flake:network', run: (a) => a.shell.write('drill', 'flake') },
    ])`,
    drillEvidence: [] as string[],
  });

  const steps = (0, eval)(shell.read('drillScript')) as DrillStep[];
  const log = (msg: string) => shell.write('drillEvidence', [...shell.read('drillEvidence'), msg]);

  const run = async () => {
    for (const step of steps) {
      log(`step=${step.name}`);
      await step.run({ shell, jump: (path: string) => location.assign(path) });
    }
  };

  return { run, evidence: shell.read('drillEvidence') };
}

Because it lives in the shell, you can update the script during an incident and instantly reproduce the same sequence on any machine.

Priya Liao

Sep 18, 2025 · 06:22 PM

Follow-up from the OP: our drills became sustainable once we treated them like *routes*, not tests.

We keep a /drills entry with a file router map, but the drill selection is stored as component-router motion evidence (so you can see it even when the file route is the same).

We still run some in CI, but the key win is that the same drill posture keys exist in production (off by default) so we can reproduce weird behavior without deploying a "special build".

Jasper Nguyen

Jan 12, 2026 · 08:03 PM

One pattern that surprised me: we store the *chosen fallback* as a first-class drill output, not a boundary implementation detail.

During a drill, the boundary writes fallbackChosen into the route surface so other panels can reflect it (and tests can assert it) without reaching into component internals.

tsx
function DrillBoundary({ children }: { children: React.ReactNode }) {
  const route = useRouteState({ fallbackChosen: 'none' });
  return (
    <AwaitBoundary
      onReject={(err) => {
        route.write('fallbackChosen', 'error:toast');
        return <div role="alert">Temporary error</div>;
      }}
    >
      {children}
    </AwaitBoundary>
  );
}

Once that key existed, drill results stopped being subjective. Everyone could see which fallback lane actually happened.