Summarized details here:
https://www.slideshare.net/brendangregg/performance-tuning-e...
[1] Long term issues like inter-clock drift and global synchronization are a rather different problem area, and the OS has tools to help there.
A pure TSC implementation will sacrifice accuracy (because it's not being trained by the HPET or corrected by NTP), performance (because it'll need to do a full syscall occasionally), or both.
If you're sophisticated like NetFlix you can probably assure yourself it's no big deal. But it's a bad idea for others to blindly do the same thing. Look at the issue with Go's timeouts. Go used gettimeofday rather than CLOCK_MONOTONIC because the authors assumed the behavior of Google's system's clock skewing algorithm. That assumption broke spectacularly for many other people not using Google's servers.
We have had a number of clock issues, and one of the first things I try is taking and instance and switching it back to xen for a few days, but those issues have not turned out to be the clocksource. Usually NTP.
AWS can comment more about the state (safety/risk) of these clocksources (given they have access to all the SW/HW internals).
Roughly 10 years ago, when I was the driver author for one of the first full-speed 10GbE NICs, we'd get complaints from customers that were sure our NIC could not do 10Gbs, as iperf showed it was limited to 3Gb/s or less. I would ask them to re-try with netperf, and they'd see full bandwidth. I eventually figured out that the complaints were coming from customers running distros without the vdso stuff, and/or running other OSes which (at the time) didn't support that (Mac OS, FreeBSD). It turns out that the difference was that iperf would call gettimeofday() around every socket write to measure bandwidth. But netperf would just issue gettimeofday calls at the start and the end of the benchmark, so iperf was effectively gettimeofday bound. Ugh.
haha, it's amazing how much software is written that basically does something ridiculous like "while (gettimeofday()) clock_gettime();".
Apart from gettimeofday() other "favourites" of mine that people are often blind to include apps that do lots of unnecessary stat-ing of files, as well as tiny read()/write()'s instead of buffering in userspace.
This is a big speed hit. Some programs can use gettimeofday extremely frequently - for example, many programs call timing functions when logging, performing sleeps, or even constantly during computations (e.g. to implement a poor-man's computation timeout).
The article suggests changing the time source to tsc as a workaround, but also warns that it could cause unwanted backwards time warps - making it dangerous to use in production. I'd be curious to hear from those who are using it in production how they avoided the "time warp" issue.
4.5x longer = 350% slower.
Just say the native calls take 22% of the time they do on EC2. Or that the EC2 calls take 450% of the time of their native counterparts.
"Faster" and "slower" when going with percentages are ripe with confusion. Please don't use them.
This is what's usually considered the "root cause" of this problem, though. It's easy enough, if it's your own program, to wrap the OS time APIs to cache the evaluated timestamp for one event-loop (or for a given length of realtime by checking with the TSC.) Most modern interpreters/VM runtimes also do this.
1) first, by eliminating the need for a context switch for libc calls such as gettimeofday(), gethrtime(), etc. (there is no public/supported interface on Solaris for syscalls, so libc would be used)
2) by providing additional, specific interfaces with certain guarantees:
https://docs.oracle.com/cd/E53394_01/html/E54766/get-sec-fro...
This was accomplished by creating a shared page in which the time is updated in the kernel in a page that is created during system startup. At process exec time that page is mapped into every process address space.
Solaris' libc was of course updated to simply read directly from this memory page. Of course, this is more practical on Solaris because libc and the kernel are tightly integrated, and because system calls are not public interfaces, but this seems greatly preferable to the VDSO mechanism.
The difference is that on Solaris, since there is no public system call interface, there's also no need for a fallback. Every program is just faster, no matter how Solaris is virtualized, since every program is using libc.
There's also no need for an administrative interface to control clocksource; the best one is always used.
- comm_page (usr/src/uts/i86pc/ml/comm_page.s) is literally a page in kernel memory with specific variables that is mapped (usr/src/uts/intel/ia32/os/comm_page_util.c) as user|read-only (to be passed to userspace, kernel mapping is normal data, AFAICT)
- the mapped comm_page is inserted into the aux vector at AT_SUN_COMMPAGE (usr/src/uts/common/exec/elf/elf.c)
- libc scans auxv for this entry, and stashes the pointer it containts (usr/src/lib/libc/port/threads/thr.c)
- When clock_gettime is called, it looks at the values in the COMMPAGE (structure is in usr/src/uts/i86pc/sys/comm_page.h, probing in usr/src/lib/commpage/common/cp_main.c) to determine if TSC can be used.
- If TSC is usable, libc uses the information there (a bunch of values) to use tsc to read time (monotonic or realtime)
Variables within comm_page are treated like normal variables and used/updated within the kernel's internal timekeeping.
Essentially, rather than having the kernel provide an entry point & have the kernel know what the (in the linux case) internal data structures look like, here libc provides the code and reads the exported data structure from the kernel.
So it isn't reading the time from this memory page, it's using TSC. In the case of CLOCK_REALTIME, corrections that are applied to TSC are read from this memory page (comm_page).
This summary only applies to Illumos. The Solaris implementation diverged significantly around build 167 (2011) long after the last OpenSolaris build Illumos was based on (build 147). It changed again significantly in 2015.
I believe Circonus contributed an alternate implementation that does some of the same things as Solaris in 2016:
https://www.circonus.com/2016/09/time-but-faster/
With that said, you are correct that whether or not it will read from a memory page instead depends on which interfaces you are using (e.g. get_hrusec()) and other subtle details.
[1]: https://blog.packagecloud.io/eng/2016/04/05/the-definitive-g...
EDIT: B would, of course, take 100% longer than A, rather than be 100% faster.
- You have a stable hardware TSC (you can check this in /proc/cpuinfo on the host, but all reasonably recent hardware should support this).
- The host has the host-side bits of the KVM pvclock enabled.
As long as you meet those two conditions, KVM should support fast vDSO-based time calls.
Only if you're using Linux guests and assuming vDSO so not really. The headline made me first go to issues with the host/virtual hardware and some syscalls being much slower than normal across the board.
I expect there are many such patches that you could use to narrow down the version range of the host kernel. Once you've that information, you may be in a better position to exploit it, knowing which bugs are and are not patched.
blog ~ touch test.c
blog ~ nano test.c
blog ~ gcc -o test test.c
blog ~ strace -ce gettimeofday ./test
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
0.00 0.000000 0 100 gettimeofday
------ ----------- ----------- --------- --------- ----------------
100.00 0.000000There are patches floating around to support vDSO timing on Xen.
But isn't AWS moving away from Xen or are they just moving away from Xen PV?
Go to your staging environment, use `strace -f -c -p $PID -e trace=clock_gettime` (or don't use -p and just launch the binary directly), replay a bit of production traffic against it, and then interrupt it and check the summary.
HTTP servers typically return a date header, often internally dates are used to figure out expiration and caching, and logging almost always includes dates.
It's incredibly easy to check the numbers of syscalls with strace, so you really should be able to get an intuition fairly easily by just playing around in staging.
this will very likely be calling time related system calls, especially clock_gettime with CLOCK_MONOTONIC.
I ran the test program on a Hyper-V VM running CentOS 7 and got the same result: 100 calls to the gettimeofday syscall. Conversely, I tested a vSphere guest (also running CentOS 7), which didn't call gettimeofday at all.
Looks like it's how the Xen hypervisor works.
https://news.ycombinator.com/item?id=13697555
It seems very closely related, unless I am mistaken.
This is about the speed of execution of the mentioned syscalls, which will be called regardless of the TZ environment variable, and how vDSO changes that. However, by setting the TZ environment variable you can avoid an additional call to stat to as it tries to determine if /etc/localtime exists.
A worse issue is that the counters may not be synchronized between cpus, which may be an issue when the process moves between sockets.
But I wouldn't call that "dangerous", it's simply a feature of the clock source. If that's an issue for your program, you should use CLOCK_MONOTONIC anyway and not rely on gettimeofday() doing the right thing.
I've worked on quite a few systems and can't think of a time where an api for getting the time would have been called so much that it would affect performance?
Apache and nginx for example, both call gettimeofday() a lot.
Edit: Quick google searches indicate software like redis and memcached also call it quite often.
Or, instead, you could just not do that. Then you could go back to being productive, instead of wasting time tracking down unstable small tweaks for edge cases that you can barely notice after looping the same syscall 5 million times in a row.
When will people learn not to micro-optimize?