What's really wrong here is that they're apparently spawning processes like crazy. Do they spawn a new process for each API call? That's like running CGI programs under Apache, like it's 1995.
It's like a terrible China router firmware, without the C. Bonus points for every straightforward way of running a throwaway command on Linux invoking fork().
I guess it's a good thing because it sets us up for another blog post once they learn of the latency gains to be had when you are not creating new processes on API requests. Hell, when someone starts looking into how this Git thing works, we might be in for a whole series.
Putting arbitrary input into a shell is dangerous, as missed escaping can result in control of the shell.
When you call exec yourself, however, you are passing the individual arguments as NULL-terminated list of strings (char*). There is no shell to abuse. Calling a process this way is about as safe as calling a function that takes strings for arguments. The function can still have vulnerabilities, but the process of calling it is safe.
Parsing text data in ad-hoc, non-standardized, not documented, not defined format is really bad for security.
Just spawning a process creates as many security problems as it solves.
If it was done right, it would look like Chrome architecture, where untrusted, isolated processes can do dangerous work but communicate with trusted process via well defined IPC protocol.
... and for RAM usage. Java applications all have a tendency to bloat the longer you keep them running.
As zegerjan wrote, Gitaly is a Go/Ruby hybrid.
The main Go process doesn't use libgit2 (for now) because we didn't want to have to deal with cgo. We already know how to deal with C extensions in Ruby, and we have a lot of existing Ruby application code that uses libgit2, so we still use it there. And that code works fine so I don't see us removing it.
In practice, sometimes spawning a Git process is faster than using libgit2, so why then not do that. Also for parts of our workload (handling Git push/pull operations), spawning a one-off process (git-upload-pack) is the most boring / tried-and-true approach.
The Go component doesn't have libgit2 binding yet, although we're looking into adding that later. That or maybe go-git[3]. But for now Gitaly is mainly focussed at migrating all git calls from the Rails monolith. Not introducing a new component now reduces the risks this project has.
[1]: https://github.com/libgit2/rugged/ [2]: https://gitlab.slack.com/archives/C027N716H/p151695430400026... [3]: https://github.com/src-d/go-git/
$ time seq 1000 | while read; do sleep 0 & done
real 0m0.185s
user 0m0.546s
sys 0m0.265s
That's less than .2ms to start a process.Processes give you operational control (CPU, memory, permissions, isolation, monitoring) that other constructs simple cannot. Decades ago when we had far slower computers, people were doing process-oriented development and forking as if it was okay (CGI, make, git).
Somehow, separate processes came to be avoided like the plague, when in reality, they are probably the smallest resource "waste" in 99% of systems.
First of all, you're only benchmarking the time it takes for fork(2) to return in the parent subshell, nothing else. The new processes don't exist yet at this point, and certainly hasn't exec'd (which tends to be why you're forking).
Second, you're not measuring the cost at all. The forked children will, at some point, start executing on other CPUs, which includes finishing configuration and running exec, which takes time. The cost is the total cycles it takes before the child is executing the intended code.
Fork is damn expensive, but whether they're too expensive depends on the usecase, and the cost of expanding hardware.
Fork time scales with the virtual memory of the forking process, and you're forking from a fresh subshell that hardly has anything allocated. It's even mentioned in the linked post that their issue stemmed from this (specifically fork lock contention spiking as fork time increased).
Calling exec() or spawn() in Node is therefore not asynchronous and can block your event loop for hundreds of milliseconds or even seconds as RSS increases.
I never understood why so many people use fork() instead of POSIX spawn(). For example OpenJDK (Java) also does this as the default for starting a process. Which leads to interesting results when you use it on a OS which does do memory over committing like Solaris. Since the process briefly doubles in memory use with fork() your process will die with an out of memory error.
Low level syscall ABI is architecture dependent.
Then shock horror they realize running a throwaway command is fork()ing the main process. But now everyone is too angsty to change it because someone out there might rely on the environment copy functionality, even when they shouldn't.
for example, here's the caveats section from the macOS fork man page:
There are limits to what you can do in the child process. To be totally safe you should restrict your-
self to only executing async-signal safe operations until such time as one of the exec functions is
called. All APIs, including global data symbols, in any framework or library should be assumed to be
unsafe after a fork() unless explicitly documented to be safe or async-signal safe. If you need to use
these frameworks in the child process, you must exec. In this situation it is reasonable to exec your-
self.
That spells defeat :)Earlier in the game, copy-on-write had to be created for the same reasons.
Threads throw a wrench in things. But fork() existed for decades before threads. O_CLOEXEC etc helps. Lots of command-line utilities don't use threads.
fork() isn't the fastest way - but in many situations it's not a problem, it's just convenient. In that respect it's somewhat like using python when you could have used go.
That means the child and parent process shares the memory (until exec() is performed).
Especially if the parent process is multi-threaded this avoids a whole lot of pagefaults that would occur if using fork() when another thread touches memory, possibly triggering a lot of copy-on-writes in the time window between calling fork() and the child calling exec()
Code: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/uni...
> A bug in Go <1.9 was causing a 30x slowdown in our Gitaly service.
Fork is not my favorite syscall:
Is the migration path that tough ?
We are working on moving the git layer to Gitaly[0] which is written in Go (and is what this blog post is about). It was one of our major bottlenecks and we've seen a lot of benefit from having made the switch. It's not done yet, but a lot of the calls to git that the application makes are now done through Gitaly.
https://gitlab.com/gitlab-org/gitaly/blob/master/internal/se...
Yet apparently nobody either caught or investigated the latency spike after the previous deployment.
First of all, I was somewhat confused by that due to the availability of copy-on-write; I wouldn't have expected fork/exec time to scale up that way.
Second, I was surprised that there wasn't an attempt to explain the behavior difference between the two systems. Can someone familiar with either or both point towards an explanation for why that's the case? It seems very odd.