1) Tell the kernel it only has a limited set of cores to work with.
The way to fix Snort’s jitter issues is to change the Linux boot parameters. For example, set “maxcpus=2”. This will cause Linux to use only the first two CPUs of the system. Sure, it knows other CPU cores exist, it just will never by default schedule a thread to run on them.
2) Manually schedule your high priority process onto a reserved core.
Then what you do in your code is call the “pthread_setaffinity_np()” function call to put your thread on one of the inactive CPUs (there is Snort configuration option to do this per process). As long as you manually put only one thread per CPU, it will NEVER be interrupted by the Linux kernel.
3) Turn off interrupts to keep things as real time as possible.
You can still get hardware interrupts, though. Interrupt handlers are really short, so probably won’t exceed your jitter budget, but if they do, you can tweak that as well. Go into “/proc/irq/smp_affinity” and turn of the interrupts in your Snort processing threads.
4) Profit?
At this point, I’m a little hazy at what precisely happens. What I think will happen is that your thread won’t be interrupted, not even for a clock cycle.
Can anyone remove the haziness? I'm more interested in this for benchmarking than performance, and wonder how it compares to other ways of increasing priority like "chrt". Is booting with a low "maxcpus" necessary, or can the same be done at runtime?
Affinity is definitely valuable, but I don't think you should need to disable interrupts for most applications. I'm not even sure if it is generally possible, since interrupts are sometimes used to swap pages or notify the kernel that a page isn't mapped or that a blocked resource is now available. The reason affinity is valuable is not because of kernel interactions. It's because of NUMA and cpu cache swapping. Affinity can prevent thread migration, which is expensive mainly because data also has to be migrated or else accessed in a less efficient manner. Likewise, make sure that if you dispatch an asynchronous call, the handler runs on the same core you sent the call from.
Finally, it's a common fallacy in these kinds of posts to act as if threads can't be used to do shared-little or shared-nothing-style multi-programming. They often aren't, but there's nothing that prevents it.
Correct. man 7 pthreads states:
In NPTL, thread synchronization primitives (mutexes, thread joining, and so on) are implemented using the Linux futex(2) system call.
A futex is a mutex that only does a system call if there is contention; you can google it.
It's also just better, because something lower priority can run whenever the thread with real-time priority suspends (e.g., waiting for disk or network I/O).
I don't know the process for making sure interrupts don't happen on a particular CPU in Linux, but if you do that, you can only be interrupted by SMIs (system management interrupts), which are done by the BIOS, not the OS. Some machines don't have SMIs, though. It's BIOS specific.
If you do have them, you can't necessarily tell by using the TSC (as the author wanted to do), because poorly-behaved BIOSs will attempt to hide that the SMI happened by fiddling with the TSC.
I haven't worked with chrt this way, but I have done the maxcpus approach mentioned in the article - with the explicit affinity combined with maxcpus you can be sure that a process/thread won't jump around to another free CPU if it goes out to disk and is then replaced with another process. From my cursory reading of chrt, I don't see anything about CPU affinity, so this is still a possibility, right? Am I missing a constraint in the Linux scheduler?
I suppose they're the same in that you can use both to basically tell the kernel to "not schedule anything that would interact with a give process in any way that would effect it's execution timeline"
The author does gloss over the fact that if you have a hyperthreading enabled processor, you want to make sure you either turn it of or don't schedule a pair of processes in different virtual cores on the same logical core (hyperthread pair), because you will see a slowdown because the two virtual cores are actually sharing boatloads of state.
Arguably, this is already happening.
What the vast majority of users need in terms of speed is acute performance. A dedicated core for every webpage is not very useful because users don't load up a bunch of tabs at the same time and then flip between them many times per second, interacting with each one. Instead CPU usage tends to happen in short bursts directly following user interaction, so even if every tab has its own core, you won't usually see very much contention for CPU between different tabs. The same goes for most native programs. Users just don't multitask fast enough to make separate cores relevant for most programs.
Parallelism has two major issues:
First, not all applications need it, in many cases you want to do just a series of operations in a single starting number, and you don't need anything else, like if you are for example calculating a factorial, if you need only one factorial, it is useless to make it more parallel.
Second, it is absurdly hard to code stuff for heavily parallelised hardware, most coders will make crap code that don't work, no matter how good we become in making helper libraries, it is totally another way of thinking.
Yes, for some things, like servers, where you can throw a user into each core, it is nice... But for many other uses, even simple single-core parallelism, like SIMD, is not much useful.
IPC has been steadily improving generation over generation, but it is a slow march. Chip frequency does not seem like it is going to go anywhere without some serious breakthroughs; processors are thermally limited, and while you can work on saving power there don't seem to be any 10x improvements in power coming that could let you crank up the clock. Scaling voltage down is great for reducing switching power, but it depends on smaller and smaller transistors, so leakage power has been steadily growing and eating into those gains.
Chips are up against a lot of walls- power consumption, heat dissipation, and so on. Chip makers have and are working on pushing forwards, but short of a new kind of transistor there do not appear to be any improvements by orders of magnitude on the horizon for single-core performance.
This is why parallelism is important. It is hard, and not every workload can be parallelised well, but there is simply no other known way to secure a 4x, 8x, 16x, etc boost in performance than 4x, 8x, 16x, etc parallelism. (Assuming your code isn't terrible, in which case fix your code!)
There are some possible ways forward. If we're sticking with transistors we might be able to switch to a material with better electrical properties, but that needs lots of research before it'll be higher performance than silicon.
There are also funky non-transistor based ideas for doing computation, like using DNA or nano-scale clockwork or ballistic electrons. I have no idea how feasible that stuff is.
In the shorter term, computer engineers are finding ways to turn those extra transistors into better single threaded performance by better prefetching, branch prediction, and re-ordering your instructions so that more than one are executed at a time even though you never thought about parallelism when writing it. That's why a modern computer core is much faster than an old Pentium 4, even though the clock speeds might be the same.
The problem with that is that using more transistors tends to provider at most a O(sqrt(n)) speedup, whereas adding more cores potentially provides an O(n) speedup.
Sorry for nitpicking, but calculating factorial can certainly be parallelized. Easiest way to do this is multiply every n-th number on each core and then multiply n results together.
You can overclock an old Core 2 higher than that, with a bit of luck and good gear. And it'll undoubtedly be cheaper than buying an IBM 'frame.
FWIW overclocking records are currently above 8GHz on Vishera using N2 cooling. Granted N2 cooling can't actually be used, but that gives you the limits of the chips. You can reach 5.5GHz on water, and 6+ on cascade or single stage (which do work as standard cooling solutions)
There had to be a better way to write that. I suppose more work per clock cycle and increased number of cores contributes the other x10 of raw performance. But then the author goes on to say they are stuck, which isn't true of performance, only clock rate. In any event, putting an "up is down" in your sentence should generally be avoided.
Edit: The >>>proscribed<<< method for resolving this is a “lock”, where… Sigh.
The article covers a lot of ground lightly. It talks about the new Haswell transactional memory instructions, the way Linux shards network counters, and a way to make Linux not use a core so you can schedule a process on it that will never be preempted.
It looks like Snort gave up on one way of doing multi-threading, but they could still go the way suggested in the OPs post.
[1] http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...
It's one of those things "everyone knows multi-threaded is better than single-threaded", but everyone's wrong.
(a) pthread_mutex_t and friends use futexes, which only call into the kernel when there actually is contention.
(b) it would be better to use chrt (change to real-time priority) than the maxcpus trick, because the former accomplishes the same thing, but allows the core to still be used if the high-priority thread suspends (e.g. to do disk or network I/o).
(c) Contrary to his claim about Snort, there is no reason to prefer a multiprocess design over a multithreaded design for a particular application. There is no savings in overhead or synchronization or anything like that by going with processes. In fact, using processes and then using memory mapping to share when you could use threads, is just making things harder for yourself for no reason.
(d) What I’m trying to show you here is that “multi-core” doesn’t automatically mean “multi-threaded”. Well, in computer science terminology, a thread is a schedulable entity, and a process is a schedulable entity with memory protection. So, he's wrong. Lots of developers talk about threads and processes as orthogonal things, though, so I can see why he made that claim.
(e) The overall theme of my talk was to impress upon the audience that in order to create scalable application, you need to move your code out of the operating system kernel. You need to code everything yourself instead of letting the kernel do the heavy lifting for you. That is horrible advice that is just going to lead to lots of bugs and wasted effort. It's premature optimization. Even most people using the Linux realtime preemption patch (PREEMPT_RT) do not have such strict requirements that they need to take this advice.
(f) Your basic design should be one thread per core and lock-free synchronization that never causes a thread to stop and wait. Might apply to certain very specific real-time (as in, embedded systems or HFT) scenarios, but in general, no, you're just wasting the core when that one thread doesn't need to use it. Prefer real-time priorities if you really need it.
(g) Specifically, I’ve tried to drill into you the idea that what people call “multi-threaded” coding is not the same as “multi-core”. Multi-threaded techniques, like mutexes, don’t scale on multi-core. Again, you can only use multiple cores in parallel if there are multiple threads. And multi-threaded techniques do scale. You definitely may want to use lock-free synchronization instead of mutexes in some specialized scenarios, though.
EDIT: OK, here is one other thing I forgot in the list above.
(h) There are multiple types of locks, like spinlocks, mutexes, critical sections, semaphores, and so on. Even among these classes there are many variations. Technically, mutexes and semaphores both are ways of protecting critical sections, and a spinlock is a way of implementing a lock (including, possibly, a mutex or semaphore lock). Again, this is to some degree the difference between developers with a shared lingo and computer scientists. But if you go by that kind of lingo, you're missing part of the picture.
I tried to benchmark the context switching of user-level threads vs kernel-level threads [0]. Specifically, libtask vs pthread. I assumed, pthread_yield would be very expensive, because it has to do a syscall into the kernel. Something like an order of magnitude or two. Interestingly, with two threads switching back and forth was only two times as slow as the user-level threads. Linux futexes seem to be really clever. Use them!
On the other hand, pthread_yield scales linearly, while libtask has a constant time yield. So for concurrency without parallelism [1] and lots of threads, the libtask library can be recommended.
[0] https://bitbucket.org/qznc/switchcost/src
[1] http://existentialtype.wordpress.com/2011/03/17/parallelism-...
(b) The entire point is that for really scalable network applications, you don't want the high-priority thread to suspend for things like disk IO. Really, you read all that and expected the thread to use blocking IO instead of asynchronous IO???
(c) The point wasn't that Snort's multi-process design is better, only that it's acceptable. It's a single-threaded design written in the 1990s that has lots of quirks that make it hard to convert to multi-threaded operation. The point is that you can still get it to multi-core without having to make it multi-threaded.
(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.
(e) I keep repeating the claim because my code written in user mode scales better than the code of people who disagree. My code scales to 20-million connections, 20-gbps, 20-million packets/second. What does your code scale to?
(f) "Real-time" is different than the "network" apps I'm talking about. The entire idea is "control plane" vs. "data plane" separation. Control operations that need real-time guarantees are very different than high-throughput data operations.
(g) Multi-threaded techniques that try to interleave multiple threads on a single core suck for high network throughput. Just ask Apache
(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.
This is, again, between whether you want to be "computer science correct," or use developer lingo that is not even necessarily consistent among all developer and OS subcultures. I think saying it the way you are saying it, instead of "You could use multiple processes to get parallelism, instead of multiple threads," is misleading.
What does your code scale to?
Well, I'm a real-time guy, so my thread scales down to microsecond overheads.
Did you change the text in your post that explains pthread_mutex_t? If not, I probably made a mistake to pick on that, because what you have there is pretty clear.
I'm not convinced that the article has it exactly wrong. There are obvious problems, like all the extra moving parts your code has. But what about runtimes that multiplex their threads (or what have you) on to one host thread + scheduler per core? It seems to fit the description, and it has been taken up by Go and Rust and probably others (what is Erlang's multicore story these days?) It also avoids the obvious problem, since the extra moving parts are in the runtime, not your application (much like they were in the kernel, not your application in the standard multi-threaded case.)
I still don't know if it saves you much. IIUC, the choice to multiplex is motivated by other reasons.
In general, if you're using any other programming paradigm (i.e., 90+% of all software), you don't need tons of cheap threads; your application probably only needs at most one per core. In that case, you aren't constantly context switching (except to legitimately mutlitask with other programs), and you aren't using up that much kernel memory to hold thread state (because you only have a few threads, not a ton). So, you should just use the kernel as it was intended.
My assumption in reading the article is that the author was very much talking about the 90+% case (he didn't really say what he was talking about in the article).
There actually is another case though, which I think is what the author is really getting at. What if you have a small number of threads that need to be extremely high performance, and they have extremely short critical sections (which is not necessarily the common case across most applications)? Then, you would not want your threads to constantly suspend in the kernel every time there is a tiny bit of contention. You'd rather have them spin for a few cycles and just wait until they can actually get the lock... or just use hardware features (like the compare and exchange instructions) to abstract away the issue.
To do either of those, yeah, you pretty much need to code it yourself or (better) get a third party library. AFAIK. You would think pthreads would have a spinlock option that can never suspend (unlike futex, which can suspend), but if it does, I'm not aware of it.
In fact, formally speaking, there are lock-free algorithms, which don't have a lock, but can have unbounded retried before they get to access the data. And then you have wait-free algorithms, which can guarantee you get access to the data within a bounded number of retries.
Cache conflicts happen between processes in the same way, and for the same reasons, and just as often, as they do between threads.
I must have missed something.
Multi-tasking is multiple processes, which mostly have nothing to do with each other, switched in and out of the processor(s) by the OS, which do not share in-process memory or context. The programmer does nothing to make this happen, and normally has little to no say in it.
Multi-threading is a single process, where the threads carefully share context and memory, and they're all working roughly on the same thing; the programmer makes this happen explicitly, and usually fucks it up.
Is there any basis for this affirmation or just the fact that his system has only 4 cores?
It's a general principle of why mobile phone CPUs aren't putting a lot more cores in their devices, and why they didn't go the Atom route of hyperhreading: code on cellphones fail to take advantage of the additional cores.
There is no reason why multi-threaded software can't scale near linearly for the right problems.
Networking is actually a "right" problem, and should be nearly embarrassingly parallel since two cores and process two unrelated packets at the same time. But, if you look at network stacks on open-source projects, you see a lot of fail. That's the point of my post, as a reference to point to why that spinlock in your networking code is a bad idea, and why it's probably better to replace it with an atomic operation or a lock-free alternative.
For the other problems, you're limited by Amdahl.
Why not just implement the pipeline entirely in one thread, and then replicate them (just like worker threads)?
What will happen is that the first worker thread will be executing stage 2, while the second worker thread is executing stage 1. The OS will automatically schedule them on different cores.
Am I missing something?
Even so-called "lock-free" synchronization has locks, they are just very short (30 clock cycles). Therefore, you still want to avoid contention if you can figure out a way to do it.
I didn't really go into enough detail in my example, but pulling packets off the network is a good example. You can have one thread do it, and therefore need no contention. Then you can setup multiple single-producer/single-consumer ring-buffers to forward those packets to worker threads to complete the processing of the packet. Thus, you essentially get rid of all the atomic/lock-free contention you would otherwise have.
Right, and that's what I'm suggesting. So, if you have a pipelined architecture, keep the pipeline inside worker threads, instead of across them, except when you need to distribute work to the workers (i.e., the first stage, where you do something like take packets from the network). I think we agree on all that. I was just curious if there was ever a reason to do it the other way, i.e., having a separate thread for each stage of the entire pipeline. It seemed like you were suggesting that was useful in some cases, but perhaps I'm reading into things too much.
Basically, what we're saying is that, first, there is a stage that cannot be accessed concurrently, and second, multiple threads may be waiting for that stage to complete before they progress.
So the downstream, waiting threads are either going to have to wait to acquire the lock, or they're going to have to wait for the producer to produce the data.
Either way, they're waiting.
If the critical section is really short and highly contended, you may not want threads waiting on the lock to suspend; you want them to spin, or to keep retrying (which is what happens in lock-free algorithms). OK, fine. Why isn't that solution just as good as having an extra thread to be the producer, and then waiting on the producer to produce more data?
Multi-threaded techniques, like mutexes, don’t scale on multi-core. Conversely, as Snort demonstrates, you can split a problem across multiple processes instead of threads, and still have multi-core code.
Synchronization is synchronization. There are inter-process synchronization primitives, including mutexes. And you can use lock-free synchronization in a single-process multi-threaded scenario.