Idempotency and retries for webhook handlers

At-least-once delivery, deduplication keys, fast ACK + async workers, HMAC verification, and DLQs—with Stripe and GitHub references.

HTTP webhooks are almost always at-least-once: networks retry, your handler may time out, and the sender may redeliver on non-2xx responses. Your system must be correct when the same logical event arrives twice or out of order relative to other events.

1. The idempotency contract

Idempotency means: processing the same message multiple times has the same effect as processing it once. For HTTP handlers, that usually means:

  1. Extract a stable event identifier (event_id, delivery_id, or a hash of payload + type + timestamp window).
  2. Persist “seen” before irreversible side effects (charge, shipment, permanent DB mutation).
  3. Return 2xx quickly so the sender stops retrying (when the work is done or safely queued).

Stripe documents idempotent API requests via Idempotency-Key headers—see Idempotent requests. Webhook delivery deduplication is separate: use Stripe’s event.id as your natural key.

GitHub sends X-GitHub-Delivery per delivery—ideal for dedupe rows.

2. Pattern: insert-first deduplication

Sketch for a relational store:

-- Pseudocode schema
CREATE TABLE processed_webhook_events (
  provider TEXT NOT NULL,
  event_id TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  PRIMARY KEY (provider, event_id)
);

Handler flow:

BEGIN;
INSERT INTO processed_webhook_events (provider, event_id)
  VALUES ('stripe', $1)
  ON CONFLICT DO NOTHING;
-- If rowcount = 0 → duplicate → COMMIT and return 200
COMMIT;
-- Else proceed with business logic (or enqueue)

Use your database’s exact semantics (ON CONFLICT, INSERT ... IF NOT EXISTS) to avoid race conditions between two concurrent deliveries of the same event.

3. Fast ACK vs heavy work

For expensive work, split the pipeline:

  1. Handler: verify signature, parse JSON, enqueue minimal payload (event id + references) to SQS / Azure Service Bus / RabbitMQ.
  2. Worker: idempotent processing with its own retries and dead-letter queue after N failures.

AWS: Lambda + SQS event source mapping and DLQs. This keeps HTTP timeouts from killing long jobs.

4. Signature verification and replay windows

Verify HMAC (or provider-specific signing) before any heavy logic. Examples:

Reject stale events when the spec allows (e.g. timestamp outside tolerance) to limit replay attacks.

5. Retries: who retries whom?

LayerTypical behavior
SenderRetries on connection errors or 5xx; may not retry on 4xx (your bug)
Your HTTP client calling third partiesUse backoff + idempotency keys for your outbound calls
Queue consumerExponential backoff; DLQ after max receives

Do not infinitely retry poison messages—move them to DLQ and alert.

6. Ordering

Assume no global order unless the provider guarantees it for a narrow stream. Design for per-aggregate ordering (e.g. per customer_id partition key in a queue) if you need serialization.

References

Related: spreadsheet to serverless, Azure OpenAI production patterns.