Idempotency and retries for webhook handlers
At-least-once delivery, deduplication keys, fast ACK + async workers, HMAC verification, and DLQs—with Stripe and GitHub references.
HTTP webhooks are almost always at-least-once: networks retry, your handler may time out, and the sender may redeliver on non-2xx responses. Your system must be correct when the same logical event arrives twice or out of order relative to other events.
1. The idempotency contract
Idempotency means: processing the same message multiple times has the same effect as processing it once. For HTTP handlers, that usually means:
- Extract a stable event identifier (
event_id,delivery_id, or a hash of payload + type + timestamp window). - Persist “seen” before irreversible side effects (charge, shipment, permanent DB mutation).
- Return
2xxquickly so the sender stops retrying (when the work is done or safely queued).
Stripe documents idempotent API requests via Idempotency-Key headers—see Idempotent requests. Webhook delivery deduplication is separate: use Stripe’s event.id as your natural key.
GitHub sends X-GitHub-Delivery per delivery—ideal for dedupe rows.
2. Pattern: insert-first deduplication
Sketch for a relational store:
-- Pseudocode schema
CREATE TABLE processed_webhook_events (
provider TEXT NOT NULL,
event_id TEXT NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (provider, event_id)
);
Handler flow:
BEGIN;
INSERT INTO processed_webhook_events (provider, event_id)
VALUES ('stripe', $1)
ON CONFLICT DO NOTHING;
-- If rowcount = 0 → duplicate → COMMIT and return 200
COMMIT;
-- Else proceed with business logic (or enqueue)
Use your database’s exact semantics (ON CONFLICT, INSERT ... IF NOT EXISTS) to avoid race conditions between two concurrent deliveries of the same event.
3. Fast ACK vs heavy work
For expensive work, split the pipeline:
- Handler: verify signature, parse JSON, enqueue minimal payload (event id + references) to SQS / Azure Service Bus / RabbitMQ.
- Worker: idempotent processing with its own retries and dead-letter queue after N failures.
AWS: Lambda + SQS event source mapping and DLQs. This keeps HTTP timeouts from killing long jobs.
4. Signature verification and replay windows
Verify HMAC (or provider-specific signing) before any heavy logic. Examples:
- Stripe:
Stripe-Signatureheader and tolerance for clock skew (ttimestamp in signed payload). - GitHub: Validating webhook deliveries with
X-Hub-Signature-256.
Reject stale events when the spec allows (e.g. timestamp outside tolerance) to limit replay attacks.
5. Retries: who retries whom?
| Layer | Typical behavior |
|---|---|
| Sender | Retries on connection errors or 5xx; may not retry on 4xx (your bug) |
| Your HTTP client calling third parties | Use backoff + idempotency keys for your outbound calls |
| Queue consumer | Exponential backoff; DLQ after max receives |
Do not infinitely retry poison messages—move them to DLQ and alert.
6. Ordering
Assume no global order unless the provider guarantees it for a narrow stream. Design for per-aggregate ordering (e.g. per customer_id partition key in a queue) if you need serialization.
References
- MDN: HTTP 429 / Retry-After — respect backoff hints when you are the client
- Stripe: Webhooks best practices
- GitHub: Webhook events and payloads
Related: spreadsheet to serverless, Azure OpenAI production patterns.