8 min read ·

Shopify webhook reliability checklist for production apps

12 production-grade practices for handling Shopify webhooks: HMAC verification, idempotency, retries, reconciliation, alerting, audit trails, and replay.

A 12-item checklist for making Shopify webhook handling production-grade. Every item below has burned someone in the last year.

The checklist

1. Verify HMAC on every request

Use constant-time comparison (ActiveSupport::SecurityUtils.secure_compare in Ruby, crypto.timingSafeEqual in Node). Naive string equality leaks signature bytes via timing attacks. Reject anything without a valid signature with HTTP 401, before any other processing.

2. Read the raw request body before parsing

HMAC is over the raw bytes Shopify sent. If your framework JSON-parses the body before you get to it, the bytes you re-serialize for verification won't match. Capture the raw body in middleware (Ruby: a Rack middleware that reads env["rack.input"] and stashes it). This is the #1 cause of "HMAC verification randomly fails."

3. Respond 200 in under 5 seconds

Shopify times out at 5s and counts the timeout as a failure. Persist the raw body, return 200 immediately, then process async via background jobs. Anything else risks burning retry attempts to no purpose.

4. Idempotency: dedup on X-Shopify-Event-Id

Shopify retries on its own. You'll see the same event multiple times. X-Shopify-Event-Id is the unique identifier Shopify documents for deduplication — the same value persists across retries of the same event. (Don't use X-Shopify-Webhook-Id for this; that header identifies the webhook subscription, not the event.) Store the event ID in a table with a unique index, ignore on collision. Without this, a retry that arrives after you already processed creates duplicate orders/charges/whatever.

5. Add your own retry layer

If your async processing fails, retry with exponential backoff for at least 24 hours. Sidekiq's defaults (25 retries over 25 days) are fine; some teams tune to 12 retries over 7 days. The point is to outlast routine outages on your side without losing the event.

6. DLQ for events that exhaust retries

Don't silently drop events that fail every retry. Move them to a dead-letter queue with the original payload, attempt history, and last error. A human needs to look at these — make them visible.

7. Reconcile against the Admin API

Hourly, fetch the latest orders/products from Shopify and compare against what you've ingested. Synthesize webhooks for any gaps. This is the only defense against Shopify-side delivery failures and against your endpoint being briefly down for the entire 4-hour Shopify retry window.

8. Monitor subscription health

Daily, GET the webhook subscriptions list via Admin API. Alert if any expected subscription is missing — once 8 retry attempts fail within the 4-hour window, Shopify removes the subscription. Shopify does send a "your webhook is failing" warning email to your Partner emergency developer email, but most teams' alerts and Partner-account email go to different inboxes. Treat the API check as your source of truth and re-register if anything is missing.

9. Plan for secret rotation

You will rotate your webhook signing secret eventually. Build a "grace window" where both old and new secret are accepted. Otherwise, every event in flight during rotation gets rejected. 1 hour is usually enough; for GDPR webhooks (which may arrive much later), 72 hours.

10. Audit-log every action

Every replay, every secret rotation, every DLQ resolution should be logged with who/what/when. When a customer asks "why was this order processed twice," you need to be able to answer with timestamps.

11. Respect Admin API rate limits in your reconciliation

Shopify's GraphQL Admin API uses leaky-bucket rate limiting. Bulk operations are your friend for large reconciliation runs — they don't count against the standard rate limit. Use them for hourly+ recon, regular GraphQL for low-frequency checks.

12. Alert on the right things

Don't alert on every failed delivery — too noisy. Do alert on:

  • DLQ size growing unexpectedly (e.g., >0 for >1 hour)
  • Sustained increase in average attempt count (downstream is flapping)
  • HMAC rejection rate >0.1% (someone is mis-configured)
  • Reconciliation finding gaps (Shopify-side issue, or your subscription is down)

How much of this can you actually build?

Honestly: you can build all of it. None of these items is novel. But each one is a distraction from your actual product. The forwarding pipeline is 200 lines; the operational tooling around it (DLQ UI, replay flow, audit log, alerts, reconciliation runs, secret rotation) is 5,000+ lines and 3+ months of engineering attention.

For most Shopify apps, building this from scratch isn't the right call — you can outsource the entire reliability layer to a tool like HookRescue for less than the cost of an engineer-week per year. The decision factor is whether webhook reliability is a core competency for your business or a tax you'd rather pay.

Production-grade in 3 minutes

HookRescue implements every item in this checklist out of the box. The setup is one URL change in your Shopify webhook subscription. Your existing handler stays exactly the same. We re-sign with your secret so HMAC verification continues to work; we forward synchronously and your handler returns 200 to us; we handle retries, DLQ, reconciliation, alerts, audit log, secret rotation grace, all of it.

Free during private beta — no credit card.

Read next