Synchronous Processors (2016) (opens in new tab)

(yodaiken.com)

30 pointspalmtree30004y ago19 comments

19 comments

This article would make more sense if it had the result of a simulation of a workload showing how much time was lost to interrupt latency and how much processor time could be saved by a different technique.

johndoe08154y ago

The Transputer and its successors, the XMOS embedded SoCs, already implement a lot of the features mentioned in this blog post by Yodaiken.

…and the subject should have “(2016)” added…

saagarjha4y ago

Fixed.

dmitrygr4y ago

And how would you do context switches if a CPU-bound task does not yield and you do not have interrupts to ... interrupt ... that?

freemint4y ago

The article says this about it:

> We could have a simple cycle timer switch on each core so that after the timer expires there an interrupt-like jump to a function to see what to do next. That jump would be perfectly synchronous since predicting the next jump can be done with 100% accuracy (or nearly 100%).

dmitrygr4y ago

Cycle accurate? So now you can predict how long RAM latency is down to the cycle, refreshes and DMA be damned?

Interrupt like? So what will you do? save context of this thread, load another...hm...sure sounds like what we already do

Taniwha4y ago

In other words a timer interrupt - with saving of state and appropriate unwinding of pipeline state (abandoning half done or out of order instructions etc etc)

Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"

pjc504y ago

Not necessarily; it's an interesting idea. Thanks to the branch predictor the CPU already has a virtual view of the instruction stream. If we tolerate a bit of latency, all we have to do is inject a "jump to ISR" magic instruction in the predicted stream. Rather like self-modifying code, except without modifying the code in memory, just at the instruction fetch point. State still has to be saved but that can be done with PUSH instructions in the ISR.

> Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"

Can be done with mailboxes/FIFOs, but yes this requires a dedicated design. And of course the CPU that does the call is then idle I think?

1 more reply

johndoe08154y ago

One approach is to use a barrel processor which switches threads after each cycle or instruction: https://en.m.wikipedia.org/wiki/Barrel_processor

freemint4y ago

That does not solve the problem at all. It just increases the number of "hyper threads", if a new process gets started and all cores are busy that process might never run.

johndoe08154y ago

It solves the problem for environments where problems like interrupt latency and timing criticality usually show up - embedded and real-time systems. In many systems, the set of running tasks in a system is fixed - there are even some very simple real-time operating systems (such as some OSEK configurations in the automotive sector) which require to statically define the set of tasks at compile time. After all, you don't suddenly feel the urge to start a game of Doom on your car's ABS controller :) (though, of course, somebody will try to do this...).

The (early) XMOS chips, for example, run at 500 MHz with four threads or, if you needed more threads, you could also configure the system to run eight threads at half the speed IIRC. If you used e.g. three threads, some execution time remained unused in the four-thread mode, there was no arbitrary division of time by the number of threads.

For real-time critical systems, you could then still run up to seven critical threads at guaranteed speed and reserve the remaining one for non timing-critical tasks (which you could then to schedule using cooperative multitasking).

The RAM was a fast on-chip SRAM, so there were no problems with refresh, access latencies etc. that you have with DRAM. However, you were constrained to 64 kB RAM per core (probably not enough to run Doom...).

The XMOS development toolchain even includes a real-time analyzer for the C/C++ code you throw at it. Unfortunately, most of the XMOS toolchain is closed source.

convolvatron4y ago

thread state is just a bit of memory.

bob10294y ago

What if you have multiple CPUs and you don't want the task to yield (i.e. for performance or latency reasons)?

dmitrygr4y ago

That limits your OS to numThreads < numCpus

bob10294y ago

What if we have multiple classes of threads with varying scheduling and interruption policies?

1 more reply

j / k navigate · click thread line to collapse

19 comments

projektfu4y ago

johndoe08154y ago

The Transputer and its successors, the XMOS embedded SoCs, already implement a lot of the features mentioned in this blog post by Yodaiken.

…and the subject should have “(2016)” added…

saagarjha4y ago

Fixed.

dmitrygr4y ago

And how would you do context switches if a CPU-bound task does not yield and you do not have interrupts to ... interrupt ... that?

freemint4y ago

The article says this about it:

dmitrygr4y ago

Cycle accurate? So now you can predict how long RAM latency is down to the cycle, refreshes and DMA be damned?

Interrupt like? So what will you do? save context of this thread, load another...hm...sure sounds like what we already do

Taniwha4y ago

In other words a timer interrupt - with saving of state and appropriate unwinding of pipeline state (abandoning half done or out of order instructions etc etc)

Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"

pjc504y ago

> Also the "do system calls by queuing requests to another CPU" is kind of at odds with "we don't need cache coherency"

Can be done with mailboxes/FIFOs, but yes this requires a dedicated design. And of course the CPU that does the call is then idle I think?

1 more reply

johndoe08154y ago

One approach is to use a barrel processor which switches threads after each cycle or instruction: https://en.m.wikipedia.org/wiki/Barrel_processor

freemint4y ago

That does not solve the problem at all. It just increases the number of "hyper threads", if a new process gets started and all cores are busy that process might never run.

johndoe08154y ago

The XMOS development toolchain even includes a real-time analyzer for the C/C++ code you throw at it. Unfortunately, most of the XMOS toolchain is closed source.

convolvatron4y ago

thread state is just a bit of memory.

bob10294y ago

What if you have multiple CPUs and you don't want the task to yield (i.e. for performance or latency reasons)?

dmitrygr4y ago

That limits your OS to numThreads < numCpus

bob10294y ago

What if we have multiple classes of threads with varying scheduling and interruption policies?

1 more reply

j / k navigate · click thread line to collapse