If you want a specific question to answer, answer this: why does PTP need hardware timestamping to achieve high precision (where the network card itself assigns timestamps to packets, rather than having the kernel do it as part of TCP/IP processing)? If we use software timestamps, why can we do microsecond precision at best? If you understand this, it goes a very long way to understanding the core ideas behind precise clock sync.
Once you have a solid understanding of PTP, look into White Rabbit. They’re able to sync two clocks with sub-ns precision. In case that isn’t obvious, that is absolutely insane.
[1] So do a lot of people. For example audio engineers. Once, an audio engineer absolutely talked my ear off about ptp. I had no idea that audio people understood clock sync so well but they do!
Indeed. PTP (various, not-necessarily compatible, versions) is at the core of modern ethernet-based audio networking: Dante (proprietary, PTP: IEEE 1588 v1), AVB (IEEE standard, PTP: 802.1AS), AES67 (AES standard, PTP: IEEE 1588 v2). And now the scope of the AVB protocol stack has been expanded to TSN for industrial and automotive time sensitive network applications.
Sadly, they're generally just a bit too expensive for me to justify it as a toy.
I don't work in trading (though not for lack of trying on my end), so most of the stuff I work on has been a lot more about "logical clocks", which are cool in their own right, but I have always wondered how much more efficient we could be if we had nanosecond-level precision to guarantee that locks are almost always uncontested.
[1] I'm not talking about those clocks that radio to Colorado or Greenwich, I mean the relatively small ones that you can buy that run locally.
This is only true if you use wall clock time as part of your database’s consistency algorithm. Generally I think this is a huge mistake. It’s almost always much easier to swap to a logical clock - which doesn’t care about wall time. And then you don’t have to worry about ntp.
The basic idea is this: event A happened before event B iff A (or something that happened after A) was observed by the node that generated B before B was generated. As a result, you end up with a dag of events - kind of like git. Some events aren’t ordered relative to one another. (We say, they happened concurrently). If you ever need a global order for all events, you can deterministically pick an arbitrary order for concurrent events by comparing ids or something. And this will give you a total order that will be the same on all peers.
If you make database events work like this, time is a little more complex. (It’s a graph traversal rather than simple numbers). But as a result the system clock doesn’t matter. No need to worry about atomic clocks, skew, drift, monotonicity, and all of that junk. It massively simplifies your system design.
Also I still remember having fun with the "Determine the order of events by saving a tuple containing monotonic time and a strictly monotonically increasing integer as follows" part.
My take on this is that second timing is close enough for this. And that all my internal systems need agree on the time. So if I'm off by 200ms or some blather from the rest of the world, I'm not overly concerned. I am concerned, however, if a random internal system is not synced to my own ntp servers.
This doesn't mean I don't keep our servers synced, just that being off by some manner of ms doesn't bother me inordinately. And when it comes to timing of events, yes, auto-increment IDs or some such are easier to deal with.
This post is about more complicated synchronization for more demanding applications. And it's very good. I'm just marveling at how in my lifetime I from "no clock is ever set right" to assuming most anything was within a second of true time.
Once you get to internationa phones, you'll have places where the phone does not include all timezones and specifically is missing the actual local timezone, so automatic sync is typically disabled so that the time can be set so that the displayed time matches local time... even if that means the system time is not correct.
I don't think civilian clock synchronization was an issue since a long time ago.
DCF77 and WWVB has been around for more than 50 years. You could use some cheap electronics and get well below millisecond accuracy. GPS has been fully operational for 30 years, but it needs more expensive device.
I suspect you could even get below 1 sec accuracy using a watch with a hacking movement and listening to radio broadcast of time beeps / pips.
The first manufactured GPS clock I owned (as in: switch it on and time is shown on a dedicated display) was in a 2007 Honda.
But a firmware bug ruined that clock: https://didhondafixtheclocks.com/
And even after it began displaying the right time again, it had the wrong date. It was offset by years and years, which was OK-ish, but also by several months.
Having the date offset by months caused the HVAC to behave in strange incurable ways because it expected the sun to be in positions where it was not.
But NTP? NTP has never been fickle for me, even in the intermittently-connected dialup days I experienced ~30 years ago: If I can get to the network occasionally, then I can connect to a few NTP servers and keep a local clock reasonably-accurate.
NTP has been resolutely awesome for me.
* NTP pool server usage requires using DNS
* people have DNSSEC setup, which requires accurate time or it fails
So if your clock is off, you cannot lookup NTP pool servers via DNS, and therefore cannot set your clock.
This sheer stupid has been discussed with package maintainers of major distros, with ntpsec, and the result is a mere shrug. Often, the answer is "but doesn't your device have a battery backed clock?", which is quite unhelpful. Many devices (routers, IOT devices, small boards, or older machines, etc) don't have a battery backed clock, or alternatively the battery may just have died.
Beyond that, the ntpsec codebase has a horrible bug where if DNS is not available when ntpsec starts, pool server addresses are never, ever retried. So if you have a complete power-fail in a datacentre rack, and your firewalls take a little longer to boot than your machines, you'll have to manually restart ntpsec to even get it to ever sync.
When discussing this bug the ntpsec lads were confused that DNS might not exist at times.
Long story short, make sure you aren't using DNS in any capacity, in NTP configs, and most especially in ntpsec configs.
One good source is just using the IPs provided by NIST. Pool servers may seem fine, but I'd trust IPs assigned to NIST to exist longer than any DNS anyhow. EG, for decades.
I worked on the NTP infra for a very large organization some time ago and the starriest thing I found was just how bad some of the clocks were on 'commodity hardware' but this just added a new parameter for triaging hardware for manufacturer replacement.
This is an ok article but it's just so very superficial. It goes too wide for such a deep subject matter.
In particular I don’t think the intuitions necessary to do distributed computing well would come to someone who snoozed through physics, who never took intro to computer engineering.
Yeah. I was a physics major and it really helped to have had my naive assumptions about time and clocks completely demolished early on by taking classes in special and general relatively. When I eventually found my way into tech a lot of distributed systems concepts that are difficult to other people (clock sync, indeterminate ordering of events, consensus) came quite naturally because of all that early training.
I think it's no accident that distributed systems theory guru Leslie Lamport had written an unpublished book on General Relativity before he wrote the famous Time, Clocks and the Ordering of Events in a Distributed System paper and the Paxos paper. In the former in particular the analogy to special relatively is quite plain to see.
you buy the hardware, plug it all in, and it works
It's to the point timing server vendors I've spoken to have their own test labs where they have to validate network gear and then publish lists of recommended and tested configurations.
Even some older cards where you'd think the PTP issues would be solved still have weird driver quirks in Linux!
Many years later, in 2020, I ended up living in San Francisco, and I had the fortune to meet Leslie Lamport after I sent him a cold email. Lovely and smart guy. This is the text of the first part of that email, just for your curiosity:
Hey Leslie!
You have accompanied me for more than 20 years. I first met your name when studying Lamport timestamps.
And then on, and on, and on, up to a few minutes ago, when I realized that you are also behind the paper and the title of "Byzantine Generals problem", renamed after the "Albanian" generals to the suggestion of Jack Goldberg. Who is he? [1]
[0]: https://en.wikipedia.org/wiki/Lamport_timestamp
[1]: Jack Goldberg (now retired) was a computer scientist and Lamport's manager at SRI.
That’s the radical developer simplicity promised by TrueTime mentioned in the article.
What TrueTime says is that clocks are synchronized within some delta just like NTP, but that delta is significantly smaller thanks to GPS time sync. That enables applications to have tighter bounds on waiting to see if a conflict may exist before committing which is why Spanner is fast. CockroachDB works similarly but given the logistical challenge of getting GPS receivers into data centers, they worked to achieve a smaller delta through better NTP-like timestamps and generally get fairly close performance.
https://programmingappliedai.substack.com/p/what-is-true-tim...
> Bounded Uncertainty: TrueTime provides a time interval, [earliest, latest], rather than a single timestamp. This interval represents the possible range of the current time with bounded uncertainty. The uncertainty is caused by clock drift, synchronization delays, and other factors in distributed systems.
In distributed training (LLMs), the bottleneck is no longer just disk I/O or CPU cycles—it’s the "straggler problem" during collective communication (like All-Reduce). When you’re running on 400Gbps+ RoCE (RDMA over Converged Ethernet) networks, the network "wire time" is often lower than the clock jitter on a standard Linux kernel.
If your clocks are skewed by even 2-3 milliseconds, your telemetry becomes essentially useless. It looks like packets are arriving before they were sent, or worse, your profiling tools can’t accurately pinpoint which GPU is stalling the rest of the 16,384-node fleet. We’ve reached a point where microsecond-accurate clocks isn't just a requirement for HFT firms; it’s becoming the baseline for anyone trying to keep $100s of millions of NVidia GPUs from idling while they wait for a collective sync.
Agree this is the best solution, I’d rather have a tiny failover period than risk serialization issues. Working with FDB has been such a joy because it’s serializable it takes away an entire class of error to consider, leading to simpler implementation.
The consequence of having multiple time domains is pretty painful when you need to reconcile logs or transaction histories across systems with different sync accuracy. Millisecond NTP logs and sub-microsecond PTP logs don’t line up cleanly, so correlating events end-to-end can become guesswork rather than deterministic ordering.
If you want reliable cross-system telemetry and audit trails, you'll need a single, high-accuracy time sync approach across your whole stack.
Back in the day, way back in the 80's, IBM replaced the VM with VMXA. VM could trap and emulate all the important instructions since they were privileged instructions except one, the STCK (store clock) instruction. So virtual machines couldn't set their virtual clocks so they were always in sync. VMXA used new hw features that let you set the virtual clock. You could specify an offset to the system clock. But some of IBM's biggest customers depended on all the virtual machines clocks always being in sync. So VMXA had to add an option to disallow setting the clock for specified virtual machines.
Except all of development knew how trivial it was to trap or modify the STCK's to produce a timestamp of you choosing. This was before it was common knowledge the client code should never be trusted. But nobody enlightened IBM corporate management. It was a serious career limiting move at IBM. It didn't matter if you were right. So I'm pretty sure some serious fortunes were made as a result.
So the question for HFT is; are they using and trusting client timestamps, or are the timestamps being generated on the market maker's servers? If the latter, how would the customer know?
This is not entirely correct. What has been agreed is to allow deviations of more than one second after 2035, so that clocks have to be adjusted less frequently (on the order of every 50-100 years is the intention). However, the allowable deviation, and how to adjust clocks when it is exceeded, has yet to be decided.
https://www.usenix.org/system/files/conference/nsdi18/nsdi18...
The authors’ work forms the basis of what the team at Clockwork.io is building, enabling accurate one-way delay measurements (rather than just RTT/2) that improve latency visibility and telemetry across CPU and GPU infrastructure
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time...
The best approach, imho, is to abandon the concept of a global time. All timestamps are wrt a specific clock. That clock will skew at a rate that varies with time. You can, hopefully, rely on any particular clock being monotonous!
My mental model is that you form a connected graph of clocks and this allows you to convert arbitrary timestamps from any clock to any clock. This is a lossy conversion that has jitter and can change with time. The fewer stops the better.
I kinda don’t like PTP. Too complicated and requires specialized hardware.
This article only touches on one class of timesync. An entirely separate class is timesync within a device. Your phone is a highly distributed compute system with many chips each of which has their own independent clock source. It’s a pain in the ass.
You also have local timesync across devices such as wearables or robotics. Connecting to a PTP system with GPS and atomic clocks is not ideal (or necessary).
TicSync is cool and useful. https://sci-hub.se/10.1109/icra.2011.5980112
At this stage, it's difficult to find an half-decent ethernet quality MAC that doesn't have PTP timestamping. It's not a particularly complicated protocol, either.
I needed to distribute PPS and 10MHz into a GNSS-denied environment, so last summer I designed a board to do this using 802.1AS gPTP with a uBlox LEA-M8T GNSS timing receiver, a 10MHz OCXO and an STM32F767 MCU. This took me about four weeks. Software is written in C, and the PTP implementation accounts for 1500 LOC.
?????
I run PTP on everything from RPI's to you name it, over fiber, ethernet, etc.
The main thing hardware gives is filtration of PTP packets or hardware timestamping.
Neither is actually required, though some software has decided to require it.
Additionally, something like 99% of sold gigabit or better chipsets since 2012 support it (I210 et al)
In my view the specialised hardware is just a way to get more accurate transmission and arrival timestamps. That's useful whether or not you use PTP.
> My mental model is that you form a connected graph of clocks and this allows you to convert arbitrary timestamps from any clock to any clock. This is a lossy conversion that has jitter and can change with time.
This sounds like the "peer to peer" equivalent to PTP. It would require every node to maintain state about it's estimate (skew, slew, variance) of every other clock. I like the concept, but obviously it adds complexity to end-stations beyond what PTP requires (i.e. increases the hardware cost of embedded implementations). Such a system would also need to model the network topology, or control routing (as PTP does), because packets traversing different routes to the same host will experience different delay and jitter statistics.
> TicSync is cool
I hadn't seen this before, but I have implemented similar convex-hull based methods for clock recovery. I agree this is obviously a good approach. Thanks for sharing.
Well, it requires having the conversion function for each edge in the traversed path. And such function needs to exist only at the location(s) performing the conversion.
> obviously it adds complexity to end-stations beyond what PTP requires
If you have PTP and it works then stick with it. If you’re trying to timesync a network of wearable devices then you don’t have PTP stamping hardware.
> because packets traversing different routes
Fair callout. It’s probably a more useful model for less internty use cases. Of which there are many!
For example when trying to timesync a collection of different sensors on different devices/microcontrollers.
Roboticists like CanBus and Ethercat. But even that is kinda overkill imho. TicSync can get you tens of microseconds of precision in user space.
A regular pulse is emitted from a specialized high-precision device, possibly over a specialized high-precision network.
Enables picosecond accuracy (or at least sub-nano).
As a teacher I love the way Judah Levine explains
Hot take: I've seen this and enough other badly configured time sync settings that I want to ban system time from robotics systems - time from startup only! If you want to know what the real world time was for a piece of data after, write what your epoch is once you have a time sync, and add epoch+start time.
But it doesn’t have to be the first requirement you relax.
But I just watched/listened to a Richard Feynmann talk on the nature of time and clocks and the futility of "synchronizing" clocks. So I'm chuckling a bit. In the general sense, I mean. Yes yes, for practical purposes in the same reference frame on earth, it's difficult but there's hope. Now, in general ... synchronizing two clocks is ... meaningless?
Alice and Bob, in different reference frames, both witness events C and D occurring. Alice says C happened before D. Bob says D happened before C. They're both correct. (And good luck synchronizing your watches, Alice and Bob!)
I hate to break it to you, but you were fooled by an AI dupe. Also took me a while to realise this. It’s sad we live in this tiring world where we have to fact check every single piece of content for authenticity. It’s just tiring. I’m sure many will reply it doesn’t matter, which of course will be funny to consider given someone went to the work of vocal cloning Feynman to make a channel of content (copyrighted of course) while claiming “no disrespect intended”.
For starters, the spacetime interval between two events IS a Lorentz invariant quantity. That could probably be used to establish a universal order for timelike separations between events. I suspect that you could use a reference clock, like a pulsar or something to act as an event against which to measure the spacetime interval to other events, and use that for ordering. Any events separated by a light-like interval are essentially simultaneous to all observers under that measure.
The problem comes for events with a space like or light like separation. In that case, the spacetime interval is still conserved, but I’m not sure how you assign order to them. Perhaps the same system works without modification, but I’m not sure.
In multicast IP mode, with multiple switches, it requires what anything running multicast between switches/etc would require (IE some form of IGMP snopping or multicast routing or .....)
In unicast IP mode, it requires nothing from your network.
Therefore, i have no idea what it means to "require support on the network".
I have used both ethernet and multicast PTP across a complete mishmash of brands and types and medias of switches, computers, etc, with no issues.
The only thing that "support" might improve is more accurate path delay data through transparent clocks. If both master and slave do accurate hardware timestamping already, and the path between them is constant, it is easily possible to get +-50 nanoseconds without any transparent clock support.
Here is the stats from a random embedded device running PTP i just accessed a second ago:
Reference ID : 50545030 (PTP0)
Stratum : 1
Ref time (UTC) : Sun Dec 28 02:47:25 2025
System time : 0.000000029 seconds slow of NTP time
Last offset : -0.000000042 seconds
RMS offset : 0.000000034 seconds
Frequency : 8.110 ppm slow
Residual freq : -0.000 ppm
Skew : 0.003 ppm
So this embedded ARM device, which is not special in any way, is maintaining time +-35ns of the grandmaster, and currently 30ns of GPS time.The card does not have an embedded hardware PTP clock, but it does do hardware timestamp and filtering.
This grandmaster is an RPI with an intel chipset on it and the PPS input pin being used to discipline the chipset's clock. It stays within +-2ns (usually +-1ns) of GPS time.
Obviously, holdover sucks, but not the point :)
This qualifies as better-than-NTP for sure, and this setup has no network support. No transparent clocks, etc. These machines have multiple media transitions involved (fiber->ethernet), etc.
The main thing transparent clock support provides in practice is dealing with highly variable delay. Either from mode of transport, number of packet processors in between your nodes, etc. Something that causes the delay to be hard to account for.
The ethernet packet processing in ethernet mode is being handled in hardware by the switches and basically all network cards. IP variants would probably be hardware assisted but not fully offloaded on all cards, and just ignored on switches (assuming they are not really routers in disguise).
The hardware timestamping is being done in the card (and the vast majority of ethernet cards have supported PTP harware timestamping for >1 decade at this point), and works perfectly fine with deep CPU sleep states.
Some don't do hardware filtering so they essentially are processing more packets that necessary but .....
We shouldn’t impose a universal timeline just because some future operation might depend on some past one. Dependencies should be explicit and local: if two operations interact, they share a causal scope; if they don’t, they shouldn’t pay the cost of coordination.
> Here’s a video of me explaining this.
Do you need a video? Do we need a 42 minute video to explain this?
I generally agree with Feynman on this stuff. We let explanations be far more complex than they need to be for most things, and it makes the hunt for accidental complexity harder because everything looks almost as complex as the problems that need more study to divine what is actually going on there.
For Spanner to be useful they needed a high transaction rate and in a distributed system that requires very tight grace periods for First Writer Wins. Tighter than you can achieve with NTP or system clocks. That’s it. That’s why they invented a new clock.
Google puts it this way:
Under external consistency, the system behaves as if all transactions run sequentially, even though Spanner actually runs them across multiple servers (and possibly in multiple datacenters) for higher performance and availability.
But that’s a bit thick for people who don’t spend weeks or years thinking about distributed systems.