Multi-core scaling: it’s not multi-threaded (opens in new tab)

(erratasec.blogspot.com)

65 pointslukes38613y ago64 comments

64 comments

He suggests an interesting approach.

1) Tell the kernel it only has a limited set of cores to work with.

The way to fix Snort’s jitter issues is to change the Linux boot parameters. For example, set “maxcpus=2”. This will cause Linux to use only the first two CPUs of the system. Sure, it knows other CPU cores exist, it just will never by default schedule a thread to run on them.

2) Manually schedule your high priority process onto a reserved core.

Then what you do in your code is call the “pthread_setaffinity_np()” function call to put your thread on one of the inactive CPUs (there is Snort configuration option to do this per process). As long as you manually put only one thread per CPU, it will NEVER be interrupted by the Linux kernel.

3) Turn off interrupts to keep things as real time as possible.

You can still get hardware interrupts, though. Interrupt handlers are really short, so probably won’t exceed your jitter budget, but if they do, you can tweak that as well. Go into “/proc/irq/smp_affinity” and turn of the interrupts in your Snort processing threads.

4) Profit?

At this point, I’m a little hazy at what precisely happens. What I think will happen is that your thread won’t be interrupted, not even for a clock cycle.

Can anyone remove the haziness? I'm more interested in this for benchmarking than performance, and wonder how it compares to other ways of increasing priority like "chrt". Is booting with a low "maxcpus" necessary, or can the same be done at runtime?

jamesaguilar13y ago

I think he has some iffy premises. I'm not certain, but I guess pthread mutexes probably use userspace mutexes under the covers. If they don't you could select a library that does. Grabbing uncontended mutexes should be basically free. It is definitely not a good idea to write all your own abstractions, as he suggests, unless you have a really great understanding of the issues involved. You'd probably still get it wrong sometimes even with that.

Affinity is definitely valuable, but I don't think you should need to disable interrupts for most applications. I'm not even sure if it is generally possible, since interrupts are sometimes used to swap pages or notify the kernel that a page isn't mapped or that a blocked resource is now available. The reason affinity is valuable is not because of kernel interactions. It's because of NUMA and cpu cache swapping. Affinity can prevent thread migration, which is expensive mainly because data also has to be migrated or else accessed in a less efficient manner. Likewise, make sure that if you dispatch an asynchronous call, the handler runs on the same core you sent the call from.

Finally, it's a common fallacy in these kinds of posts to act as if threads can't be used to do shared-little or shared-nothing-style multi-programming. They often aren't, but there's nothing that prevents it.

javert13y ago

I'm not certain, but I guess pthread mutexes probably use userspace mutexes under the covers.

Correct. man 7 pthreads states:

In NPTL, thread synchronization primitives (mutexes, thread joining, and so on) are implemented using the Linux futex(2) system call.

A futex is a mutex that only does a system call if there is contention; you can google it.

3 more replies

javert13y ago

Setting a thread (or process) to a higher real-time priority than everything else (i.e., with chrt) will accomplish the same thing he's describing with maxcpus (though I wasn't aware maxcpus worked the way he said, so I can't vouch for it myself).

It's also just better, because something lower priority can run whenever the thread with real-time priority suspends (e.g., waiting for disk or network I/O).

I don't know the process for making sure interrupts don't happen on a particular CPU in Linux, but if you do that, you can only be interrupted by SMIs (system management interrupts), which are done by the BIOS, not the OS. Some machines don't have SMIs, though. It's BIOS specific.

If you do have them, you can't necessarily tell by using the TSC (as the author wanted to do), because poorly-behaved BIOSs will attempt to hide that the SMI happened by fiddling with the TSC.

newhouseb13y ago

> will accomplish the same thing

I haven't worked with chrt this way, but I have done the maxcpus approach mentioned in the article - with the explicit affinity combined with maxcpus you can be sure that a process/thread won't jump around to another free CPU if it goes out to disk and is then replaced with another process. From my cursory reading of chrt, I don't see anything about CPU affinity, so this is still a possibility, right? Am I missing a constraint in the Linux scheduler?

I suppose they're the same in that you can use both to basically tell the kernel to "not schedule anything that would interact with a give process in any way that would effect it's execution timeline"

newhouseb13y ago

maxcpus needs to be set at boot time. You can turn CPUs on/off logically at runtime but that's not equivalent to "take this CPU out of the default scheduler rotation", it's equivalent to "power it off, I may unplug the entire CPU and swap it out". The latter would seem fancier than the former, so not sure why there isn't the means to set maxcpus at runtime, but there doesn't appear to be.

The author does gloss over the fact that if you have a hyperthreading enabled processor, you want to make sure you either turn it of or don't schedule a pair of processes in different virtual cores on the same logical core (hyperthread pair), because you will see a slowdown because the two virtual cores are actually sharing boatloads of state.

6ren13y ago

Hypothesis: we will never solve multi-core for general purpose computing (there's also http://en.wikipedia.org/wiki/Amdahl%27s_law). But we can do multi-core for the embarrassing parallelizable - such as graphics (top GPUs have over 1000 cores), so instead of solving this problem, our focus will shift to those tasks for which multi-core does work - because it's only these that keep improving at Moore's Law-like rates.

Arguably, this is already happening.

joe_the_user13y ago

I am not sure what would theoretically prevent each web page and each application from running on a separate core (or two). Considering the number of people who love multiple tabs, I can't see how that wouldn't seem like a win for even thirty or more cores.

mistercow13y ago

The problem is that most web pages and programs don't actually do very much when they're in the background, and many (most?) of the software that does grind away for long periods of time without user interaction (compilers, ray tracers, video encoding etc.) are already able to make good use of many cores. But users who fit that profile are a niche market.

What the vast majority of users need in terms of speed is acute performance. A dedicated core for every webpage is not very useful because users don't load up a bunch of tabs at the same time and then flip between them many times per second, interacting with each one. Instead CPU usage tends to happen in short bursts directly following user interaction, so even if every tab has its own core, you won't usually see very much contention for CPU between different tabs. The same goes for most native programs. Users just don't multitask fast enough to make separate cores relevant for most programs.

1 more reply

tbrownaw13y ago

I thought some browsers already do this? But it still won't make use of more than a very small number of cores, one (or a couple) for the tab you're interacting with, and a small part of another one for all the other tabs that are sitting there idle.

1 more reply

speeder13y ago

I wonder if we will ever figure a way to resume improving clock cycles instead of adding more parallelism.

Parallelism has two major issues:

First, not all applications need it, in many cases you want to do just a series of operations in a single starting number, and you don't need anything else, like if you are for example calculating a factorial, if you need only one factorial, it is useless to make it more parallel.

Second, it is absurdly hard to code stuff for heavily parallelised hardware, most coders will make crap code that don't work, no matter how good we become in making helper libraries, it is totally another way of thinking.

Yes, for some things, like servers, where you can throw a user into each core, it is nice... But for many other uses, even simple single-core parallelism, like SIMD, is not much useful.

sliverstorm13y ago

I wonder if we will ever figure a way to resume improving clock cycles instead of adding more parallelism.

IPC has been steadily improving generation over generation, but it is a slow march. Chip frequency does not seem like it is going to go anywhere without some serious breakthroughs; processors are thermally limited, and while you can work on saving power there don't seem to be any 10x improvements in power coming that could let you crank up the clock. Scaling voltage down is great for reducing switching power, but it depends on smaller and smaller transistors, so leakage power has been steadily growing and eating into those gains.

Chips are up against a lot of walls- power consumption, heat dissipation, and so on. Chip makers have and are working on pushing forwards, but short of a new kind of transistor there do not appear to be any improvements by orders of magnitude on the horizon for single-core performance.

This is why parallelism is important. It is hard, and not every workload can be parallelised well, but there is simply no other known way to secure a 4x, 8x, 16x, etc boost in performance than 4x, 8x, 16x, etc parallelism. (Assuming your code isn't terrible, in which case fix your code!)

Symmetry13y ago

For increasing frequency the problem is we used to be able to just bump the clock speeds when the circuits shrank, but we've gotten small enough that the transistors start to leak now, so we have to drive them with less voltage to prevent the chip from overheating, meaning the clock has to come back down </huge-oversimplification>.

There are some possible ways forward. If we're sticking with transistors we might be able to switch to a material with better electrical properties, but that needs lots of research before it'll be higher performance than silicon.

There are also funky non-transistor based ideas for doing computation, like using DNA or nano-scale clockwork or ballistic electrons. I have no idea how feasible that stuff is.

In the shorter term, computer engineers are finding ways to turn those extra transistors into better single threaded performance by better prefetching, branch prediction, and re-ordering your instructions so that more than one are executed at a time even though you never thought about parallelism when writing it. That's why a modern computer core is much faster than an old Pentium 4, even though the clock speeds might be the same.

The problem with that is that using more transistors tends to provider at most a O(sqrt(n)) speedup, whereas adding more cores potentially provides an O(n) speedup.

xentronium13y ago

> First, not all applications need it, in many cases you want to do just a series of operations in a single starting number, and you don't need anything else, like if you are for example calculating a factorial, if you need only one factorial, it is useless to make it more parallel.

Sorry for nitpicking, but calculating factorial can certainly be parallelized. Easiest way to do this is multiply every n-th number on each core and then multiply n results together.

speeder13y ago

I doubt that you can have a much increase in performance as cores increase unless you are calculating numbers with huge amount of bits.

2 more replies

stevenameyer13y ago

Well what we really need to continue to improve clock cycles is better cooling. Overclocking modern chips is really easy and all you really need to get a pretty solid increase in speed is a decent cpu cooler. Yes to get drastic increases you need to up voltage to the chip and there are concerns about the chip degrading faster at higher clock cycles, but for the most part a solid increase in speed can be achieved simply by telling it to go faster and making sure it doesn't overheat.

tbrownaw13y ago

Sure parallel code is harder than sequential code, but it's not really all that much harder (maybe about 8 credit-hours at your local university?). The reason it's "too hard" is that most programs aren't slow enough to be worth the bother.

TallGuyShort13y ago

If you're interested in faster clock cycles, check out the specs on this beast: http://www-03.ibm.com/systems/z/hardware/zenterprise/zec12.h...

masklinn13y ago

> If you're interested in faster clock cycles, check out the specs on this beast: http://www-03.ibm.com/systems/z/hardware/zenterprise/zec12.h....

You can overclock an old Core 2 higher than that, with a bit of luck and good gear. And it'll undoubtedly be cheaper than buying an IBM 'frame.

FWIW overclocking records are currently above 8GHz on Vishera using N2 cooling. Granted N2 cooling can't actually be used, but that gives you the limits of the chips. You can reach 5.5GHz on water, and 6+ on cascade or single stage (which do work as standard cooling solutions)

jws13y ago

…from 33-MHz to 3-GHz, a thousand-fold increase…

There had to be a better way to write that. I suppose more work per clock cycle and increased number of cores contributes the other x10 of raw performance. But then the author goes on to say they are stuck, which isn't true of performance, only clock rate. In any event, putting an "up is down" in your sentence should generally be avoided.

Edit: The >>>proscribed<<< method for resolving this is a “lock”, where… Sigh.

The article covers a lot of ground lightly. It talks about the new Haswell transactional memory instructions, the way Linux shards network counters, and a way to make Linux not use a core so you can schedule a process on it that will never be preempted.

shawkinaw13y ago

I think he just screwed up the math.

kyrra13y ago

Tangentially related, Snort was doing research to move to a multi-threaded architecture, but decided against it due to cache synchronization problems [1]. Though, their thoughts about splitting up processing was quite different than what the OP blog post suggests.

It looks like Snort gave up on one way of doing multi-threading, but they could still go the way suggested in the OPs post.

[1] http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...

robertgraham13y ago

Yea, some engineers created a ground-up rewrite of Snort called "Suricata" that was multi-threaded, and therefore faster than Snort, which is only single-threaded. Suricata then failed to show any benchmark where they exceeded Snort's speed, and they fail to mention that Snort works fine on multi-core.

It's one of those things "everyone knows multi-threaded is better than single-threaded", but everyone's wrong.

javert13y ago

So, this post has a number of errors, and is fundamentally wrong.

(a) pthread_mutex_t and friends use futexes, which only call into the kernel when there actually is contention.

(b) it would be better to use chrt (change to real-time priority) than the maxcpus trick, because the former accomplishes the same thing, but allows the core to still be used if the high-priority thread suspends (e.g. to do disk or network I/o).

(c) Contrary to his claim about Snort, there is no reason to prefer a multiprocess design over a multithreaded design for a particular application. There is no savings in overhead or synchronization or anything like that by going with processes. In fact, using processes and then using memory mapping to share when you could use threads, is just making things harder for yourself for no reason.

(d) What I’m trying to show you here is that “multi-core” doesn’t automatically mean “multi-threaded”. Well, in computer science terminology, a thread is a schedulable entity, and a process is a schedulable entity with memory protection. So, he's wrong. Lots of developers talk about threads and processes as orthogonal things, though, so I can see why he made that claim.

(e) The overall theme of my talk was to impress upon the audience that in order to create scalable application, you need to move your code out of the operating system kernel. You need to code everything yourself instead of letting the kernel do the heavy lifting for you. That is horrible advice that is just going to lead to lots of bugs and wasted effort. It's premature optimization. Even most people using the Linux realtime preemption patch (PREEMPT_RT) do not have such strict requirements that they need to take this advice.

(f) Your basic design should be one thread per core and lock-free synchronization that never causes a thread to stop and wait. Might apply to certain very specific real-time (as in, embedded systems or HFT) scenarios, but in general, no, you're just wasting the core when that one thread doesn't need to use it. Prefer real-time priorities if you really need it.

(g) Specifically, I’ve tried to drill into you the idea that what people call “multi-threaded” coding is not the same as “multi-core”. Multi-threaded techniques, like mutexes, don’t scale on multi-core. Again, you can only use multiple cores in parallel if there are multiple threads. And multi-threaded techniques do scale. You definitely may want to use lock-free synchronization instead of mutexes in some specialized scenarios, though.

EDIT: OK, here is one other thing I forgot in the list above.

(h) There are multiple types of locks, like spinlocks, mutexes, critical sections, semaphores, and so on. Even among these classes there are many variations. Technically, mutexes and semaphores both are ways of protecting critical sections, and a spinlock is a way of implementing a lock (including, possibly, a mutex or semaphore lock). Again, this is to some degree the difference between developers with a shared lingo and computer scientists. But if you go by that kind of lingo, you're missing part of the picture.

qznc13y ago

Agreed.

I tried to benchmark the context switching of user-level threads vs kernel-level threads [0]. Specifically, libtask vs pthread. I assumed, pthread_yield would be very expensive, because it has to do a syscall into the kernel. Something like an order of magnitude or two. Interestingly, with two threads switching back and forth was only two times as slow as the user-level threads. Linux futexes seem to be really clever. Use them!

On the other hand, pthread_yield scales linearly, while libtask has a constant time yield. So for concurrency without parallelism [1] and lots of threads, the libtask library can be recommended.

[0] https://bitbucket.org/qznc/switchcost/src

[1] http://existentialtype.wordpress.com/2011/03/17/parallelism-...

robertgraham13y ago

(a) What part of "In the Linux pthread_mutex_t, when code stops and waits, it does a system call to return back to the kernel" do you not understand? That's how "futex" works: when it fails to get a lock, it must stop and wait, and therefore does a system call. When it doesn't have to wait because there is no contention, then it doesn't make a system call.

(b) The entire point is that for really scalable network applications, you don't want the high-priority thread to suspend for things like disk IO. Really, you read all that and expected the thread to use blocking IO instead of asynchronous IO???

(c) The point wasn't that Snort's multi-process design is better, only that it's acceptable. It's a single-threaded design written in the 1990s that has lots of quirks that make it hard to convert to multi-threaded operation. The point is that you can still get it to multi-core without having to make it multi-threaded.

(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.

(e) I keep repeating the claim because my code written in user mode scales better than the code of people who disagree. My code scales to 20-million connections, 20-gbps, 20-million packets/second. What does your code scale to?

(f) "Real-time" is different than the "network" apps I'm talking about. The entire idea is "control plane" vs. "data plane" separation. Control operations that need real-time guarantees are very different than high-throughput data operations.

(g) Multi-threaded techniques that try to interleave multiple threads on a single core suck for high network throughput. Just ask Apache

javert13y ago

A lot of what you're saying hinges on being specific to what you call "really scalable network applications," but you never said anything about that in your post.

(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.

This is, again, between whether you want to be "computer science correct," or use developer lingo that is not even necessarily consistent among all developer and OS subcultures. I think saying it the way you are saying it, instead of "You could use multiple processes to get parallelism, instead of multiple threads," is misleading.

What does your code scale to?

Well, I'm a real-time guy, so my thread scales down to microsecond overheads.

Did you change the text in your post that explains pthread_mutex_t? If not, I probably made a mistake to pick on that, because what you have there is pretty clear.

smosher13y ago

After reading this I am wondering if there have been any measurements taken of the time 'wasted' in the kernel. You make it sound like there's a lot less wasted than is claimed by the author.

I'm not convinced that the article has it exactly wrong. There are obvious problems, like all the extra moving parts your code has. But what about runtimes that multiplex their threads (or what have you) on to one host thread + scheduler per core? It seems to fit the description, and it has been taken up by Go and Rust and probably others (what is Erlang's multicore story these days?) It also avoids the obvious problem, since the extra moving parts are in the runtime, not your application (much like they were in the kernel, not your application in the standard multi-threaded case.)

I still don't know if it saves you much. IIUC, the choice to multiplex is motivated by other reasons.

javert13y ago

If you want to program using a ton of cheap threads (like in Go or Erlang or something), you pretty much have to do it in the runtime (i.e., use user-level threads), because you will waste a ton of time in the kernel on initialization and context switching, and you will waste a lot of memory.

In general, if you're using any other programming paradigm (i.e., 90+% of all software), you don't need tons of cheap threads; your application probably only needs at most one per core. In that case, you aren't constantly context switching (except to legitimately mutlitask with other programs), and you aren't using up that much kernel memory to hold thread state (because you only have a few threads, not a ton). So, you should just use the kernel as it was intended.

My assumption in reading the article is that the author was very much talking about the 90+% case (he didn't really say what he was talking about in the article).

There actually is another case though, which I think is what the author is really getting at. What if you have a small number of threads that need to be extremely high performance, and they have extremely short critical sections (which is not necessarily the common case across most applications)? Then, you would not want your threads to constantly suspend in the kernel every time there is a tiny bit of contention. You'd rather have them spin for a few cycles and just wait until they can actually get the lock... or just use hardware features (like the compare and exchange instructions) to abstract away the issue.

To do either of those, yeah, you pretty much need to code it yourself or (better) get a third party library. AFAIK. You would think pthreads would have a spinlock option that can never suspend (unlike futex, which can suspend), but if it does, I'm not aware of it.

In fact, formally speaking, there are lock-free algorithms, which don't have a lock, but can have unbounded retried before they get to access the data. And then you have wait-free algorithms, which can guarantee you get access to the data within a bounded number of retries.

1 more reply

ajross13y ago

Hm... (c) needs a little qualification. Certainly in a multithreaded environment it's possible to end up in a cache contention situation accidentally, simply because you or the compiler happened to put two "write often from a single thread" values in the same cache line. Think of things like global/singleton objects which doesn't really need to be shared but are. A multiprocess design that limited shared memory to stuff that's truly shared won't have that pitfall.

javert13y ago

No, that's not right.

Cache conflicts happen between processes in the same way, and for the same reasons, and just as often, as they do between threads.

1 more reply

nonamegiven13y ago

"Multi-tasking was the problem of making a single core run multiple tasks. The core would switch quickly from one task to the next to make it appear they were all running at the same time, even though during any particular microsecond only one task was running at a time. We now call this “multi-threading”, where “threads” are lighter weight tasks."

I must have missed something.

Multi-tasking is multiple processes, which mostly have nothing to do with each other, switched in and out of the processor(s) by the OS, which do not share in-process memory or context. The programmer does nothing to make this happen, and normally has little to no say in it.

Multi-threading is a single process, where the threads carefully share context and memory, and they're all working roughly on the same thing; the programmer makes this happen explicitly, and usually fucks it up.

https://en.wikipedia.org/wiki/Multitasking

https://en.wikipedia.org/wiki/Multitasking#Multithreading

stefantalpalaru13y ago

"Multi-threaded software goes up to about four cores, but past that point, it fails to get any benefit from additional cores."

Is there any basis for this affirmation or just the fact that his system has only 4 cores?

robertgraham13y ago

My desktop is 6 hyperthreaded cores (12 "cores" total).

It's a general principle of why mobile phone CPUs aren't putting a lot more cores in their devices, and why they didn't go the Atom route of hyperhreading: code on cellphones fail to take advantage of the additional cores.

montecarl13y ago

This article speaks in broad generalities that seemed to be based on the authors particular interests.

There is no reason why multi-threaded software can't scale near linearly for the right problems.

robertgraham13y ago

My point was to talk about the wrong problems.

Networking is actually a "right" problem, and should be nearly embarrassingly parallel since two cores and process two unrelated packets at the same time. But, if you look at network stacks on open-source projects, you see a lot of fail. That's the point of my post, as a reference to point to why that spinlock in your networking code is a bad idea, and why it's probably better to replace it with an atomic operation or a lock-free alternative.

pnathan13y ago

> There is no reason why multi-threaded software can't scale near linearly for the right problems.

For the other problems, you're limited by Amdahl.

javert13y ago

There are two fundamental ways of doing this: pipelining and worker-threads. In the pipeline model, each thread does a different task, then hands off the task to the next thread in the pipeline.

Why not just implement the pipeline entirely in one thread, and then replicate them (just like worker threads)?

What will happen is that the first worker thread will be executing stage 2, while the second worker thread is executing stage 1. The OS will automatically schedule them on different cores.

Am I missing something?

robertgraham13y ago

When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.

Even so-called "lock-free" synchronization has locks, they are just very short (30 clock cycles). Therefore, you still want to avoid contention if you can figure out a way to do it.

I didn't really go into enough detail in my example, but pulling packets off the network is a good example. You can have one thread do it, and therefore need no contention. Then you can setup multiple single-producer/single-consumer ring-buffers to forward those packets to worker threads to complete the processing of the packet. Thus, you essentially get rid of all the atomic/lock-free contention you would otherwise have.

javert13y ago

* When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.*

Right, and that's what I'm suggesting. So, if you have a pipelined architecture, keep the pipeline inside worker threads, instead of across them, except when you need to distribute work to the workers (i.e., the first stage, where you do something like take packets from the network). I think we agree on all that. I was just curious if there was ever a reason to do it the other way, i.e., having a separate thread for each stage of the entire pipeline. It seemed like you were suggesting that was useful in some cases, but perhaps I'm reading into things too much.

wmf13y ago

If a certain stage has global state a pipeline architecture may result in lower contention because that state is only accessed by one core and thus doesn't have to be locked. Lock-free producer-consumer communication between stages can be efficient. (AFAIK LineRate just took pipelining to the bank.) If your app has no global state a run-to-completion aka worker architecture is more efficient.

javert13y ago

I feel like this is "common knowledge," so maybe I'm just missing something, but I'm not convinced yet.

Basically, what we're saying is that, first, there is a stage that cannot be accessed concurrently, and second, multiple threads may be waiting for that stage to complete before they progress.

So the downstream, waiting threads are either going to have to wait to acquire the lock, or they're going to have to wait for the producer to produce the data.

Either way, they're waiting.

If the critical section is really short and highly contended, you may not want threads waiting on the lock to suspend; you want them to spin, or to keep retrying (which is what happens in lock-free algorithms). OK, fine. Why isn't that solution just as good as having an extra thread to be the producer, and then waiting on the producer to produce more data?

miga13y ago

I recall similar results on nearly all applications since my late MSc study: Mutexes are bad, pipes and sockets give better scaling. Thread sync primitives just sometimes scale up to 8-12 cores, but indeed multiprocess applications usually get much faster. In the age of GCed VMs one needs to also consider sync cost of GC.

abraininavat13y ago

Maybe I'm missing something, but I'm not getting the point. It seems to me there's no fundamental difference between multiprocess with shared-memory regions for anything that needs to be shared and multithreaded with mostly thread-local storage plus some shared data. The kernel is going to multiplex your single-threaded processes among the available cores just the same as it will multiplex your multiple threads among the available cores.

Multi-threaded techniques, like mutexes, don’t scale on multi-core. Conversely, as Snort demonstrates, you can split a problem across multiple processes instead of threads, and still have multi-core code.

Synchronization is synchronization. There are inter-process synchronization primitives, including mutexes. And you can use lock-free synchronization in a single-process multi-threaded scenario.

j / k navigate · click thread line to collapse

64 comments

nkurz13y ago

He suggests an interesting approach.

1) Tell the kernel it only has a limited set of cores to work with.

2) Manually schedule your high priority process onto a reserved core.

3) Turn off interrupts to keep things as real time as possible.

4) Profit?

At this point, I’m a little hazy at what precisely happens. What I think will happen is that your thread won’t be interrupted, not even for a clock cycle.

jamesaguilar13y ago

javert13y ago

I'm not certain, but I guess pthread mutexes probably use userspace mutexes under the covers.

Correct. man 7 pthreads states:

In NPTL, thread synchronization primitives (mutexes, thread joining, and so on) are implemented using the Linux futex(2) system call.

A futex is a mutex that only does a system call if there is contention; you can google it.

3 more replies

javert13y ago

It's also just better, because something lower priority can run whenever the thread with real-time priority suspends (e.g., waiting for disk or network I/O).

If you do have them, you can't necessarily tell by using the TSC (as the author wanted to do), because poorly-behaved BIOSs will attempt to hide that the SMI happened by fiddling with the TSC.

newhouseb13y ago

> will accomplish the same thing

newhouseb13y ago

6ren13y ago

Arguably, this is already happening.

joe_the_user13y ago

mistercow13y ago

1 more reply

tbrownaw13y ago

1 more reply

speeder13y ago

I wonder if we will ever figure a way to resume improving clock cycles instead of adding more parallelism.

Parallelism has two major issues:

Yes, for some things, like servers, where you can throw a user into each core, it is nice... But for many other uses, even simple single-core parallelism, like SIMD, is not much useful.

sliverstorm13y ago

I wonder if we will ever figure a way to resume improving clock cycles instead of adding more parallelism.

Symmetry13y ago

There are also funky non-transistor based ideas for doing computation, like using DNA or nano-scale clockwork or ballistic electrons. I have no idea how feasible that stuff is.

The problem with that is that using more transistors tends to provider at most a O(sqrt(n)) speedup, whereas adding more cores potentially provides an O(n) speedup.

xentronium13y ago

Sorry for nitpicking, but calculating factorial can certainly be parallelized. Easiest way to do this is multiply every n-th number on each core and then multiply n results together.

speeder13y ago

I doubt that you can have a much increase in performance as cores increase unless you are calculating numbers with huge amount of bits.

2 more replies

stevenameyer13y ago

tbrownaw13y ago

TallGuyShort13y ago

If you're interested in faster clock cycles, check out the specs on this beast: http://www-03.ibm.com/systems/z/hardware/zenterprise/zec12.h...

masklinn13y ago

> If you're interested in faster clock cycles, check out the specs on this beast: http://www-03.ibm.com/systems/z/hardware/zenterprise/zec12.h....

You can overclock an old Core 2 higher than that, with a bit of luck and good gear. And it'll undoubtedly be cheaper than buying an IBM 'frame.

jws13y ago

…from 33-MHz to 3-GHz, a thousand-fold increase…

Edit: The >>>proscribed<<< method for resolving this is a “lock”, where… Sigh.

shawkinaw13y ago

I think he just screwed up the math.

kyrra13y ago

It looks like Snort gave up on one way of doing multi-threading, but they could still go the way suggested in the OPs post.

[1] http://securitysauce.blogspot.com/2009/04/snort-30-beta-3-re...

robertgraham13y ago

It's one of those things "everyone knows multi-threaded is better than single-threaded", but everyone's wrong.

javert13y ago

So, this post has a number of errors, and is fundamentally wrong.

(a) pthread_mutex_t and friends use futexes, which only call into the kernel when there actually is contention.

EDIT: OK, here is one other thing I forgot in the list above.

qznc13y ago

Agreed.

On the other hand, pthread_yield scales linearly, while libtask has a constant time yield. So for concurrency without parallelism [1] and lots of threads, the libtask library can be recommended.

[0] https://bitbucket.org/qznc/switchcost/src

[1] http://existentialtype.wordpress.com/2011/03/17/parallelism-...

robertgraham13y ago

(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.

(g) Multi-threaded techniques that try to interleave multiple threads on a single core suck for high network throughput. Just ask Apache

javert13y ago

A lot of what you're saying hinges on being specific to what you call "really scalable network applications," but you never said anything about that in your post.

(d) How is "multi-core doesn't mean multi-threaded"??? Snort is a multi-core app today and isn't multi-threaded.

What does your code scale to?

Well, I'm a real-time guy, so my thread scales down to microsecond overheads.

Did you change the text in your post that explains pthread_mutex_t? If not, I probably made a mistake to pick on that, because what you have there is pretty clear.

smosher13y ago

After reading this I am wondering if there have been any measurements taken of the time 'wasted' in the kernel. You make it sound like there's a lot less wasted than is claimed by the author.

I still don't know if it saves you much. IIUC, the choice to multiplex is motivated by other reasons.

javert13y ago

My assumption in reading the article is that the author was very much talking about the 90+% case (he didn't really say what he was talking about in the article).

1 more reply

ajross13y ago

javert13y ago

No, that's not right.

Cache conflicts happen between processes in the same way, and for the same reasons, and just as often, as they do between threads.

1 more reply

nonamegiven13y ago

I must have missed something.

https://en.wikipedia.org/wiki/Multitasking

https://en.wikipedia.org/wiki/Multitasking#Multithreading

stefantalpalaru13y ago

"Multi-threaded software goes up to about four cores, but past that point, it fails to get any benefit from additional cores."

Is there any basis for this affirmation or just the fact that his system has only 4 cores?

robertgraham13y ago

My desktop is 6 hyperthreaded cores (12 "cores" total).

montecarl13y ago

This article speaks in broad generalities that seemed to be based on the authors particular interests.

There is no reason why multi-threaded software can't scale near linearly for the right problems.

robertgraham13y ago

My point was to talk about the wrong problems.

pnathan13y ago

> There is no reason why multi-threaded software can't scale near linearly for the right problems.

For the other problems, you're limited by Amdahl.

javert13y ago

There are two fundamental ways of doing this: pipelining and worker-threads. In the pipeline model, each thread does a different task, then hands off the task to the next thread in the pipeline.

Why not just implement the pipeline entirely in one thread, and then replicate them (just like worker threads)?

What will happen is that the first worker thread will be executing stage 2, while the second worker thread is executing stage 1. The OS will automatically schedule them on different cores.

Am I missing something?

robertgraham13y ago

When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.

Even so-called "lock-free" synchronization has locks, they are just very short (30 clock cycles). Therefore, you still want to avoid contention if you can figure out a way to do it.

javert13y ago

* When there is high-contention for a resource, it's better that one thread do it and access it contention-free, rather than make multiple threads content for it.*

wmf13y ago

javert13y ago

I feel like this is "common knowledge," so maybe I'm just missing something, but I'm not convinced yet.

Basically, what we're saying is that, first, there is a stage that cannot be accessed concurrently, and second, multiple threads may be waiting for that stage to complete before they progress.

So the downstream, waiting threads are either going to have to wait to acquire the lock, or they're going to have to wait for the producer to produce the data.

Either way, they're waiting.

miga13y ago

abraininavat13y ago

Synchronization is synchronization. There are inter-process synchronization primitives, including mutexes. And you can use lock-free synchronization in a single-process multi-threaded scenario.

j / k navigate · click thread line to collapse