Skip to content

Reconnection and Offline Queueing

advanced17 min read

The Connection Is a Lie

Here's a truth that will save you weeks of debugging: you never truly know if you're connected. You know if you were connected (the last message went through) and you know when you're definitely disconnected (the socket closed). But the gap between "connected" and "aware of disconnection" can be anywhere from milliseconds to minutes.

TCP keepalive timers default to 2 hours. NAT routers silently drop idle connections after 30-60 seconds. Mobile networks switch between towers, briefly dropping packets. Your user walks into an elevator. Your user's laptop goes to sleep for 3 seconds.

Building real-time features that don't account for this is building on a fantasy.

Let's start with the biggest trap:

// DON'T trust this alone
if (navigator.onLine) {
  // "Online" just means the network interface is up
  // It does NOT mean you can reach your server
}

navigator.onLine returns false when the network adapter is off (airplane mode, no WiFi). That's about all it reliably tells you. It returns true when you're connected to WiFi — even if that WiFi has no internet, even if your server is down, even if a captive portal is blocking you.

The online and offline events are similarly unreliable. They fire on network interface changes, not on actual reachability.

Mental Model

Think of navigator.onLine like checking if your car is running. The engine running doesn't mean the road ahead is clear. You need to actually drive (send a request) to find out if you can get to your destination.

Real Connection Detection

The only reliable way to know you're connected: try to send something and see if it works.

class ConnectionMonitor {
  private state: 'online' | 'offline' | 'unknown' = 'unknown';
  private listeners = new Set<(state: 'online' | 'offline') => void>();
  private checkInterval: ReturnType<typeof setInterval> | null = null;

  constructor(
    private healthUrl: string,
    private intervalMs = 10_000
  ) {}

  start(): void {
    window.addEventListener('online', () => this.check());
    window.addEventListener('offline', () => this.setState('offline'));

    this.check();
    this.checkInterval = setInterval(() => this.check(), this.intervalMs);

    document.addEventListener('visibilitychange', () => {
      if (document.visibilityState === 'visible') {
        this.check();
      }
    });
  }

  private async check(): Promise<void> {
    try {
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), 5000);
      await fetch(this.healthUrl, {
        method: 'HEAD',
        cache: 'no-store',
        signal: controller.signal,
      });
      clearTimeout(timeout);
      this.setState('online');
    } catch {
      this.setState('offline');
    }
  }

  private setState(next: 'online' | 'offline'): void {
    if (this.state === next) return;
    this.state = next;
    this.listeners.forEach((fn) => fn(next));
  }

  onChange(fn: (state: 'online' | 'offline') => void): () => void {
    this.listeners.add(fn);
    return () => this.listeners.delete(fn);
  }

  stop(): void {
    if (this.checkInterval) clearInterval(this.checkInterval);
  }
}
Quiz
Why is navigator.onLine unreliable for determining server reachability?

Exponential Backoff with Jitter

When a connection drops, the naive approach is to reconnect immediately. When 10,000 clients lose connection simultaneously (server restart, deploy, network blip), they all reconnect at the same instant — creating a thundering herd that can take down the server before it even finishes starting up.

Exponential backoff with jitter solves this:

function calculateBackoff(
  attempt: number,
  baseMs = 1000,
  maxMs = 30000
): number {
  const exponential = Math.min(baseMs * Math.pow(2, attempt), maxMs);
  const jitter = exponential * (0.5 + Math.random() * 0.5);
  return jitter;
}

// Attempt 0: 500-1000ms
// Attempt 1: 1000-2000ms
// Attempt 2: 2000-4000ms
// Attempt 3: 4000-8000ms
// Attempt 4: 8000-16000ms
// Attempt 5+: 15000-30000ms (capped)

There are three jitter strategies:

StrategyFormulaWhen to use
Full jitterrandom(0, base * 2^attempt)Best for most cases — maximum spread
Equal jitterbase * 2^attempt / 2 + random(0, base * 2^attempt / 2)When you want a guaranteed minimum delay
Decorrelated jitterrandom(base, prev_delay * 3)When you want delay correlated to previous attempt

AWS's research shows full jitter produces the lowest total completion time for a group of competing clients. It has the widest spread, which means the least contention.

Quiz
After 5 failed reconnection attempts with base delay 1000ms and max 30000ms, what is the backoff range with full jitter?

The Offline Message Queue

When the connection drops, messages shouldn't be lost. They should be queued locally and replayed when the connection resumes. For short disconnections, an in-memory queue works. For longer offline periods (minutes, hours), you need persistent storage.

interface QueuedMessage {
  id: string;
  type: string;
  payload: unknown;
  timestamp: number;
  attempts: number;
}

class OfflineQueue {
  private db: IDBDatabase | null = null;
  private readonly storeName = 'message-queue';

  async open(): Promise<void> {
    return new Promise((resolve, reject) => {
      const request = indexedDB.open('offline-queue', 1);
      request.onupgradeneeded = () => {
        const db = request.result;
        if (!db.objectStoreNames.contains(this.storeName)) {
          const store = db.createObjectStore(this.storeName, { keyPath: 'id' });
          store.createIndex('timestamp', 'timestamp');
        }
      };
      request.onsuccess = () => {
        this.db = request.result;
        resolve();
      };
      request.onerror = () => reject(request.error);
    });
  }

  async enqueue(message: Omit<QueuedMessage, 'attempts'>): Promise<void> {
    if (!this.db) throw new Error('Queue not initialized');

    return new Promise((resolve, reject) => {
      const tx = this.db!.transaction(this.storeName, 'readwrite');
      tx.objectStore(this.storeName).put({ ...message, attempts: 0 });
      tx.oncomplete = () => resolve();
      tx.onerror = () => reject(tx.error);
    });
  }

  async drain(): Promise<QueuedMessage[]> {
    if (!this.db) throw new Error('Queue not initialized');

    return new Promise((resolve, reject) => {
      const tx = this.db!.transaction(this.storeName, 'readonly');
      const index = tx.objectStore(this.storeName).index('timestamp');
      const request = index.getAll();
      request.onsuccess = () => resolve(request.result);
      request.onerror = () => reject(request.error);
    });
  }

  async remove(id: string): Promise<void> {
    if (!this.db) throw new Error('Queue not initialized');

    return new Promise((resolve, reject) => {
      const tx = this.db!.transaction(this.storeName, 'readwrite');
      tx.objectStore(this.storeName).delete(id);
      tx.oncomplete = () => resolve();
      tx.onerror = () => reject(tx.error);
    });
  }

  async clear(): Promise<void> {
    if (!this.db) throw new Error('Queue not initialized');

    return new Promise((resolve, reject) => {
      const tx = this.db!.transaction(this.storeName, 'readwrite');
      tx.objectStore(this.storeName).clear();
      tx.oncomplete = () => resolve();
      tx.onerror = () => reject(tx.error);
    });
  }
}
Common Trap

IndexedDB operations are asynchronous but NOT microtask-based. They use their own event loop integration via DOM events. This means awaiting IndexedDB in a Promise wrapper works, but the timing is different from Promise chains. Don't assume IndexedDB callbacks fire before the next microtask checkpoint.

Queue Replay on Reconnection

The tricky part isn't queuing messages — it's replaying them correctly. You need to handle:

  1. Ordering: Replay in the order they were queued (timestamps)
  2. Deduplication: The server might have received some messages before the disconnect
  3. Staleness: A queued message from 2 hours ago might no longer be valid
  4. Backpressure: Don't flood the server with 500 queued messages at once
class QueueReplayer {
  constructor(
    private queue: OfflineQueue,
    private maxAge: number = 3600_000,
    private batchSize: number = 10,
    private batchDelay: number = 100
  ) {}

  async replay(
    send: (msg: QueuedMessage) => Promise<boolean>
  ): Promise<{ sent: number; dropped: number; failed: number }> {
    const messages = await this.queue.drain();
    const now = Date.now();
    let sent = 0;
    let dropped = 0;
    let failed = 0;

    for (let i = 0; i < messages.length; i += this.batchSize) {
      const batch = messages.slice(i, i + this.batchSize);

      for (const msg of batch) {
        if (now - msg.timestamp > this.maxAge) {
          await this.queue.remove(msg.id);
          dropped++;
          continue;
        }

        try {
          const ack = await send(msg);
          if (ack) {
            await this.queue.remove(msg.id);
            sent++;
          } else {
            failed++;
          }
        } catch {
          failed++;
        }
      }

      if (i + this.batchSize < messages.length) {
        await new Promise((r) => setTimeout(r, this.batchDelay));
      }
    }

    return { sent, dropped, failed };
  }
}
Quiz
Why should queued messages have a maximum age (TTL) for replay?

Building a Resilient Connection Manager

Let's bring it all together — connection monitoring, exponential backoff, offline queuing, and replay:

type ManagerState = 'connected' | 'connecting' | 'reconnecting' | 'offline';
type StateListener = (state: ManagerState) => void;

class ResilientConnectionManager {
  private ws: WebSocket | null = null;
  private state: ManagerState = 'offline';
  private reconnectAttempt = 0;
  private reconnectTimer: ReturnType<typeof setTimeout> | null = null;
  private heartbeatTimer: ReturnType<typeof setInterval> | null = null;
  private lastPongAt = 0;
  private queue: OfflineQueue;
  private stateListeners = new Set<StateListener>();
  private messageHandlers = new Map<string, Set<(data: unknown) => void>>();

  constructor(
    private url: string,
    private opts = {
      maxReconnectAttempts: 15,
      baseDelay: 1000,
      maxDelay: 30000,
      heartbeatInterval: 25000,
      heartbeatTimeout: 10000,
      maxQueueAge: 300_000,
    }
  ) {
    this.queue = new OfflineQueue();
  }

  async start(): Promise<void> {
    await this.queue.open();

    document.addEventListener('visibilitychange', () => {
      if (document.visibilityState === 'visible' && this.state === 'offline') {
        this.reconnectAttempt = 0;
        this.connect();
      }
    });

    window.addEventListener('online', () => {
      if (this.state === 'offline') {
        this.reconnectAttempt = 0;
        this.connect();
      }
    });

    this.connect();
  }

  private connect(): void {
    if (this.ws?.readyState === WebSocket.OPEN) return;

    this.setState(this.reconnectAttempt > 0 ? 'reconnecting' : 'connecting');

    try {
      this.ws = new WebSocket(this.url);
    } catch {
      this.scheduleReconnect();
      return;
    }

    this.ws.addEventListener('open', async () => {
      this.setState('connected');
      this.reconnectAttempt = 0;
      this.startHeartbeat();

      const replayer = new QueueReplayer(
        this.queue,
        this.opts.maxQueueAge
      );
      await replayer.replay(async (msg) => {
        this.ws?.send(JSON.stringify({ key: msg.id, type: msg.type, payload: msg.payload }));
        return true;
      });
    });

    this.ws.addEventListener('message', (event) => {
      const msg = JSON.parse(event.data);
      if (msg.type === '__pong') {
        this.lastPongAt = Date.now();
        return;
      }
      const handlers = this.messageHandlers.get(msg.type);
      handlers?.forEach((fn) => fn(msg.payload));
    });

    this.ws.addEventListener('close', (event) => {
      this.stopHeartbeat();
      if (event.code !== 1000) {
        this.scheduleReconnect();
      } else {
        this.setState('offline');
      }
    });

    this.ws.addEventListener('error', () => {
      this.ws?.close();
    });
  }

  async send(type: string, payload: unknown): Promise<void> {
    const id = `${Date.now()}-${crypto.randomUUID()}`;
    const message = { id, type, payload, timestamp: Date.now() };

    if (this.ws?.readyState === WebSocket.OPEN) {
      this.ws.send(JSON.stringify({ key: id, type, payload }));
    } else {
      await this.queue.enqueue(message);
    }
  }

  on(type: string, handler: (data: unknown) => void): () => void {
    if (!this.messageHandlers.has(type)) {
      this.messageHandlers.set(type, new Set());
    }
    this.messageHandlers.get(type)!.add(handler);
    return () => this.messageHandlers.get(type)?.delete(handler);
  }

  onStateChange(fn: StateListener): () => void {
    this.stateListeners.add(fn);
    return () => this.stateListeners.delete(fn);
  }

  private scheduleReconnect(): void {
    if (this.reconnectAttempt >= this.opts.maxReconnectAttempts) {
      this.setState('offline');
      return;
    }
    const delay = calculateBackoff(
      this.reconnectAttempt,
      this.opts.baseDelay,
      this.opts.maxDelay
    );
    this.reconnectAttempt++;
    this.setState('reconnecting');
    this.reconnectTimer = setTimeout(() => this.connect(), delay);
  }

  private startHeartbeat(): void {
    this.lastPongAt = Date.now();
    this.heartbeatTimer = setInterval(() => {
      if (this.ws?.readyState !== WebSocket.OPEN) return;

      if (Date.now() - this.lastPongAt > this.opts.heartbeatTimeout) {
        this.ws.close(4000, 'Heartbeat timeout');
        return;
      }

      this.ws.send(JSON.stringify({ type: '__ping', ts: Date.now() }));
    }, this.opts.heartbeatInterval);
  }

  private stopHeartbeat(): void {
    if (this.heartbeatTimer) {
      clearInterval(this.heartbeatTimer);
      this.heartbeatTimer = null;
    }
  }

  private setState(next: ManagerState): void {
    if (this.state === next) return;
    this.state = next;
    this.stateListeners.forEach((fn) => fn(next));
  }

  disconnect(): void {
    if (this.reconnectTimer) clearTimeout(this.reconnectTimer);
    this.stopHeartbeat();
    this.ws?.close(1000, 'Client disconnect');
    this.setState('offline');
  }
}
What developers doWhat they should do
Reconnecting immediately without backoff
Immediate reconnection creates a thundering herd when many clients disconnect simultaneously (server restart, deploy). This can prevent the server from recovering.
Use exponential backoff with full jitter
Trusting navigator.onLine to determine connectivity
navigator.onLine only reflects the network interface state. You can be 'online' on a WiFi with no internet, behind a captive portal, or when your server is down.
Use actual request-based health checks to your server
Using localStorage for the offline message queue
localStorage is synchronous (blocks the main thread), limited to 5-10MB, and only stores strings. IndexedDB is async, supports structured data, and has much larger storage limits.
Use IndexedDB for persistent offline queuing
Replaying all queued messages instantly on reconnect
Flooding the server with hundreds of queued messages immediately after reconnection can overwhelm it. Batching with delays gives the server time to process, and TTL discards stale messages.
Batch replay with delays and TTL-based expiration
Interview Question

System Design: Offline-Capable Chat Application

Design a chat app that works offline. Users should be able to compose messages while offline and see them marked as "pending." When connectivity returns, messages should be sent in order. Handle: message that was "pending" for 3 hours (the channel might have been deleted), conflict between pending and received messages, and the UX for showing the user what's synced vs pending. Discuss IndexedDB schema design for the queue.