Timestamps don't solve the issue, and neither do "thin payloads" since the receiver has no idea how long to wait before assuming that the order is certain, and if you have a problem on the sender side it could cause logic errors for all of your clients.
Most of these problems are solved if the receiver doesn't process the webhook immediately, but instead queues it internally. You don't have issues with the queue being stalled due to one bad webhook, because there is no event-specific processing happening on the receiver (other than perhaps ignoring some events). The queue can still be stalled if there is a wider problem, but as soon as the problem is resolved, the system can catch up on those queued webhooks, and synchronization integrity is maintained.
Having said all that, if I were to design a new system I would go with a pull-based system instead. In this system, the client would request a range (start time, max count) of events via an HTTP request, and the response would include the "end time" that can be used in the next query. A "webhook" would contain an empty payload, and would simply indicate that the queue had become non-empty - this could be omitted entirely if realtime updates are not required, instead having the client poll.
The advantages of this approach are that it's easy for consumers to "replay" a set of events if they accidentally lose them, and it's also a lot more efficient, since many events can be sent per request (we gain some of this benefit at the moment by supporting "batch" webhooks containing multiple events, but it requires opt-in from the client.) Additionally, it allows webhooks to be versioned more easily, since you can have versioned endpoints for fetching events, and it also allows you to have an arbitrary number of consumers of the same set of events with no additional complexity.
You obviously CAN guarantee ordering, it's just that you can't guarantee it as the sender, you need cooperation from the receiver. Additionally, putting them in a receive queue on the receiver doesn't solve the issue unless the receiver takes extra care to also read from the queue in strict (non-overlapping) order which is rarely the case, and even then has significant throughput implications. So it really is all on the receiver. This piece was written from the context of the sender.
Timestamps definitely don't solve the issue, I explicitly said to use a centralized sequence number if you must (not a great idea in most cases). Thin payloads: the idea behind that is essentially to use the webhooks as a "please update" kind of notification and then you get the most recent data from the server. Essentially what you called a "pull system", it's a combination of both a push (webhook) to know when to pull, and the pull to get the data. This also doesn't work as nicely in many scenarios (because oftentimes, receivers want the data immediately without having to fetch), but it's good in others.
Please take a look at the content of the article (rather than just the title), I've addressed most of it there too.
On your customer and card example, the issue is not message delivery order but processing order, or more precisely prerequisite satisfaction.
My first thought looking at it was to just store the data of any of the hooks coming in, check the prerequisites each time, and only process the whole when everything needed has arrived.
Trying to dictate order from the sender without any cooperation from the receiver seems like a fool's errand, as in any real world scenario where it really matters, the receiver will also want a way to check it actually received everything in order.
What happens if two transactions commit out of order? tx1 with a lower timestamp commits after tx2 with a higher timestamp has committed - and your client just saw tx2's timestamp.
Or if you have ≥$maxCount number of events changed the same exact timestamp?
If two transactions are non-causal, it doesn't matter which order the events arrive in the queue, but once the message is in the queue, the order is fixed.
> Or if you have ≥$maxCount number of events changed the same exact timestamp?
Use a sufficiently precise timestamp that this doesn't happen, or add a counter in the low bits. The only reason to use a timestamp rather than a simple incrementing counter is to make it more convenient for recipients to re-request historical events (eg. I want to replay all events since yesterday) and to make debugging easier, since with a counter it's a bit meaningless.
The timestamp is not meaningful for the actual event, its only purpose is to specify where this event sits in the total order.
I used /events to apply writes from Stripe to a local database for this reason in the tdog CLI:
While I largely agree with you, I'm hesitant to say that it is always preferable to use an /events endpoint. There are two reasons:
1. This requires the client to essentially implement an event-sourced architecture. There are many advantages to such architectures, but they are more complicated and can be tricky to implement.
2. It's important to consider the direction of coupling in systems, and how that affects you're ability to evolve the architecture of the whole system.
3. Polling is generally going to involve a higher amount of network traffic, and will have to be weighed against the latency requirements for processing an event.
Stripe just so happens to offer both?
Senders could guarantee ordering by only sending webhook n+1 after the HTTP request for webhook n completes, rather than sending them concurrently or in arbitrary order. For efficiency, perhaps only guarantee ordering for hooks related to each resource rather than all of a customer's hooks.
Or, include a monotonic counter in the webhook so the recipient can tell when it would apply an old state on top of a new one.
What the recipient does when they receive the webhook is up to them (delays, parallelism, etc.), but at least they'd know the correct event order.
The author raises a good point about what to do in the face of errors, but I'd vastly prefer to handle special behavior upon recipient error (stall, dead letter queue) to the current Stripe reality of "things come in out of order, and we don't give you the info needed to reassemble the order on your end".
Counter makes it slightly better because then you can reconstruct the order without the above artificial limit, though it's also not great (though indeed much better!).
That's non-trivial engineering to foist upon every recipient of your webhooks.
I like the idea of the /events pull-based endpoint, which keeps engineering much simpler on for recipient: https://blog.sequin.io/events-not-webhooks/
Polling /events solves some problems but introduces others. A mix of push (webhooks) and pull (/events) can also work, which is what I was referring to with the "thin clients", though it's not a great experience for many use-cases and it requires state (many webhooks recipients are stateless - e.g Zapier or Slack).
Not webhook specific but a couple hours today figuring out that some our service calls to internal services look like they open & are sent & processing, but the target server doesnt even see the request for a full 8s sometimes. The call itself was not thrle problem, the service just hadnt started until a long time after data was all sent.
That's not sufficient. Intermediate proxies can reorder your requests however they wish for whatever reason they want, and then change behavior with no notice at any time. In the real world of HTTP you'll get duplicates, false positives and every other conceivable failure mode.
> or at least identify the order of those webhooks and provide ways to identify or discover that order
Sure, you might invent some protocol that incorporates a sequence number or uses some chaining mechanism.
Thing is this; if you find yourself engaging in such gymnastics and you're any good as an engineer it needs to occur to you that you're using the wrong medium, hopefully long before you obligate yourself to the task. "Webhooks" are a pretty fragile thing to use when your requirements involve stuff like "order." If it did fail to occur to you then you're unlikely to get whatever sequencing mechanism you invent working properly either, because that's actually a hard problem that doesn't submit to the sort of muddlers unaware that "You Can't Guarantee Webhook Ordering."
P.S. Svix is great, super happy customer here :)
You can even keep your existing webhook code by providing a synchronous bridge to Kafka so "just send them in order but wait for the 200 before sending the next one." Boom, now you are guaranteed the events are recorded and processed in order.