Bresenham's Circle Drawing Algorithm (2021) (opens in new tab)

(funloop.org)

107 pointsRojerGS1y ago45 comments

45 comments

Do note that Bresenham’s family of algorithms (and much of the venerable Computer Graphics Principles and Practice) are from a bygone era where computers executed 1 instruction per cycle without pipelining or prediction.

These days processors prefer to draw shapes by having a coarse grained pass that conservatively selects tiles of interest then brute-force evaluates each pixel in each tile independently of all others.

Instead of minimizing total work, the goal is to maximize pipelined parallelism while skipping over unnecessary work in large blocks.

Someone1y ago

> a bygone era where computers executed 1 instruction per cycle without pipelining or prediction

1 instruction per cycle? What luxury bygone era did you grow up in?

Wikipedia tells me the algorithm is from 1962 on an IBM 1401 (https://en.wikipedia.org/wiki/Bresenham's_line_algorithm#His...

That definitely didn’t have many single cycle instructions. Skimming https://ibm-1401.info/A24-6447-0_1401_1460_Instruction_and_T..., I couldn’t find any.

Certainly, in the era of 8-bit CPUs like Z80 and 6502 programmers would have been lyric about “1 instruction per cycle”

Actually, did any CPU ever “execute 1 instruction per cycle without pipelining or prediction” (or, slightly looser “had fixed time instructions without pipelining or prediction”)?

RISC introduced fixed time instructions, but also pipelining.

chillingeffect1y ago

Adsp-218x family of harvard architecture DSPs. Each 25 ns clock cycle = 1 MAC, 1 data fetch, and one instruction fetch. All instructions 1 cycle long. And many other gizmos like reversible address bit ranges for DFT in place, separate register bank for IRQ handlers. And all laid out by hand.

kragen1y ago

no, you're right, you almost always need pipelining to get one instruction per clock cycle

but there are a lot of cpus out there—maybe the majority now—that are pipelined microarchitectures that get one instruction per cycle without much or any prediction. avrs, most of the cortex-m* (all?), most modern implementations of old slow isas like the z80 and 8051, etc. big processors like your laptop cpu and cellphone cpu are of course superscalar, but they are a tiny minority of all processors. even inside the cellphone case, they're outnumbered by in-order scalar microcontrollers

without prediction, of course, you have a pipeline bubble every time you have a branch, so you never quite hit 1 ipc with scalar execution. but it's usually pretty close, and even with prediction, sometimes you miss. and usually if you have branch prediction, you also have a cache, because ain't nobody got time to share one triflin memory bus between instructions and data

so pipelining gets you from, say, 3 clocks per instruction down to 1.2 or so. then prediction gets you from 1.2 down to, say, 1.02

buescher1y ago

I remember the bragging point on the RTX2000 in the very late eighties was "a MIP per megahertz".

kragen1y ago

my limited experience with stack machines is that a stack mip is about half a register-machine mip :-(

consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):

            @@ Set sleep for iwait to r0 milliseconds.
            @@ (r0 must be under 1000)
            .thumb_func
    waitis: ldr r2, =wait           @ struct timeval
            movs r3, #0             @ 0 sec
            str r3, [r2]            @ .tv_sec = 0
            ldr r1, =1000           @ multiplier for ms
            mul r0, r1
            str r0, [r2, #4]        @ set .tv_usec
            bx lr

            .bss
    wait:   .fill 8                 @ the struct timeval
            .text

these are all perfectly normal register-machine instructions; you could translate them one-to-one to almost any register machine. on a few of them you could drop one, writing something like str #0, [wait]. the whole function is straight-line execution, a single basic block, and seven instructions long. this is almost a best case for a stack machine; it looks like this:

    lit(0)      ; load immediate constant
    lea(wait)   ; load address of wait onto stack
    !           ; store 0 in wait
    lit(1000)   ; another constant
    *           ; multiply argument by constant
    lea(wait)   ; load address again
    lit(4)      ; offset of .tv_usec
    +           ; calculate address of wait.tv_usec
    !           ; store product in tv_usec
    ret         ; return from subroutine

that's 10 instructions, about 50% longer than the 6 or 7 of the two-operand register machine. but the typical case is worse. and basically the reason is that the average number of stack manipulations or constant pushes that you need to do to get the right operands on top of the stack for your memory accesses and computational operations is roughly 1. sometimes you'll have a * or ! (store) or + that's not preceded by a stack manipulation or a load-immediate or a load-address operation, and that time the stack machine wins, but other times it'll be preceded by two or three of them, and that time the stack machine loses

so it averages out to about two stack instructions per register-machine instruction. call and return is faster on the stack machine, but passing arguments and return values in registers on the register machine can take away most of that advantage too

the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this

this doesn't necessarily mean that stack machines like the rtx-2000 are a bad design approach! the design rationale is that you get very short path lengths, so you can clock the design higher than you could clock a design that had register-file muxes in the critical path, and you also avoid the branching penalty that pipelines impose on you, and you use less silicon area. mainstream computation took a different path, but plausibly the two-stack designs like the rtx-2000, the novix nc4000, the shboom, the mup21, and the greenarrays f18a could have been competitive with the mainstream risc etc. approach. but you do need a higher clock rate because each instruction does less

i don't remember if dr. ting wrote a book about the rtx-2000, but he did write a very nice book about its predecessor the nc4000, going into some of these tricks: https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf

fasa991y ago

"oh these new-fangled kids what with their superscalar processors. 100 instructions in a cycle, phooey! Back in my day, it was dang gummed 100 cycles for each instruction, and gosh darnit we liked it! Now that was just for the add instruction, a division, well some say they never figured out how many cycles that was because it took too long. I had an onion in my belt which was the style at the time"

owisd1y ago

Probably meant 'cycle' in the sense of instruction cycle, rather than clock cycle.

Someone1y ago

Even then, few CPUs execute all instructions in the same amount of time. Division, for example, still takes longer than a register move on most architectures. Multiplication, historically, did too.

As a sibling comment points out, DSPs can be an exception. I’m far from an expert on them, but CPUs that try to run all instructions in the same fixed amount of time used to accomplish that by avoiding complex addressing modes and by omitting the really slow instructions (division and instructions that push/pop multiple registers onto/from the stack being prime examples).

IIRC, some of them also accomplished it by running some of the simpler instructions slower than necessary (raw speed isn’t as important as being hard real time isn many applications)

1 more reply

chiph1y ago

I think it depends. I had Dr. Bresenham as my graphics instructor (he taught for a while after he retired from IBM) and the class used Borland Turbo Pascal and for it's time it was fast. Not as fast as raw assembly. But faster than Borlands Turbo C that had just come out.

So far as 1 instruction per cycle - Wikipedia says the 80286 (the top dog PC processor of the time) could execute 0.21 "typical" instructions per clock. Optimized code was up to 0.5 instructions per clock. And that agrees with my memories.

Today, I would try and use parallelism if I could. With lots of conditions though - is the process being applied to the image trivially parallelizable, will most/all of it fit in cache, etc. Trying to parallelize Bresenham's algorithms though would be futile - when drawing circles you can reflect it into the different quadrants (big savings) but the algorithm itself is going to be pretty serial because it has to keep up with the error coefficient as it draws.

ack_complete1y ago

It's actually not difficult to vectorize Bresenham algorithms, at least the line algorithm. You just have to preroll the algorithm for each of the lanes and then adjust the steps and error factors so each lane steps 4-8 pixels ahead at a time interleaved. I've done this for the floor type 1 render in Ogg Vorbis, which is defined in terms of a Bresenham-like algorithm.

kragen1y ago

oh wow, this is awesome, thanks! i never realized!

eru1y ago

It's also from an era when floats were rather expensive.

mabster1y ago

I still picture them as expensive. Things like trig functions are still very expensive.

dahart1y ago

I learned relatively recently that trig functions on the GPU are free if you don’t use too many of them; there’s a separate hardware pipe so they can execute in parallel with floats adds and muls. There’s still extra latency, but it’ll hide if there’s enough other stuff in the vicinity.

1 more reply

eru1y ago

Yes, but here it's about avoiding multiplication (and division).

I suspect on a modern processor the branches (ie "if"s) in Bresenham's algorithm are gonna be more expensive than the multiplications and divisions in the naive algorithm.

1 more reply

_0ffh1y ago

Nah, while cycles/instruction where indeed fixed in those days (and for some time yet to come), it was not necessarily 1 cycle but rather depended on the instruction.

tengwar21y ago

Indeed. I used this algorithm on OS/2 1.0 as part of a GUI (OS/2 did not have a GUI until 1.1). That was on an 80386. MASM came with a nice ring-bound A6 book which summarised the instruction set, including timings. I seem to remember that 3 cycles was normal for a short instruction, but many were considerably longer.

userbinator1y ago

In particular, his line-drawing algorithm is easily beaten by fixed-point methods, which also have the advantage of a very short and non-branchy inner loop:

https://news.ycombinator.com/item?id=9954975

mysterydip1y ago

"yes, but" there's still places that can take advantage of such algorithms today, namely microcontrollers. I think some scripting languages may even apply here, although many interpreters do some level of compilation/optimization instead of serial execution.

amelius1y ago

Sounds like your algorithm is from a bygone era where power was not important ;)

corysama1y ago

If you can get the work done fast and hurry the processor back into a low-power sleep state ASAP, it can actually be power efficient too!

phire1y ago

1 instruction per cycle? No, that's only possible with pipelining.

We are talking about instructions which took 8-12 cycles to complete.

kragen1y ago

usually this is correct, but there are some exceptions. most instructions on a non-pipelined but synchronous stack machine like the mup21 take a single cycle, for example

even with a register file, it isn't really inherent that you need to decode inputs, do alu operations, and write outputs in separate clock cycles; you can do all of that in combinational logic except writing the outputs, and you can even decode which register to write the output to. it just means your max clock rate is in the toilet

for that kind of thing a harvard architecture is pretty useful; it allows you to read an instruction in instruction memory at the same time you're reading or writing data in data memory, instead of in two separate cycles

brcmthrowaway1y ago

Got an exsmple?

possiblywrong1y ago

> Note that if F(x,y)=0, then the point (x,y) is exactly on the circle. If F(x,y)>0, then the point is outside of the circle, and if F(x,y)<0 then the point is inside of it. In other words, given any point (x,y), F(x,y) is the distance from the true circle line [my emphasis].

This last is not quite true. The exact distance from the circle, call it G(x,y), is the corresponding difference of square roots, i.e.,

  def G(x, y, r):
    return math.sqrt(x * x + y * y) - math.sqrt(r * r)

and G(x,y) isn't just the square root of F(x,y), and indeed doesn't behave monotonically with respect to F(x,y).

It's an interest property of Bresenham's algorithm, that I've never seen even stated let alone proved in the literature, that this doesn't matter, and the algorithm is indeed exact in the sense that it always chooses the next point based on which is truly closest to the circle... despite using an error function that is only an approximation.

bsenftner1y ago

My computer graphics professor back in the early 80's was Prof. Glen Bresenham, but not The Bresenham. It was a lot of fun being at SIGGRAPH back then and watching people freak upon reading his name badge. He'd let them believe for a bit, and then explain it's not him. Al Acorn was one that freaked, and that was fun.

kragen1y ago

alcorn?

bsenftner1y ago

Allan Alcorn, American electrical engineer and computer scientist, is an American pioneering engineer and computer scientist best known for creating Pong, one of the first video games

kragen1y ago

yeah, i know him. not acorn

KingOfCoders1y ago

What I found most amazing about Bresenham during my Amiga days, is how you can use it to zoom and shrink bitmaps.

nikolay1y ago

This is easy; anti-aliasing is thougher. I used this algorithm in the '90s to come up with one for drawing ellipses. I've never had to do a disc or filled ellipse with or without outloune, but that would be interesting, too. Line size > 1 would be interesting as well.

seanhunter1y ago

Feels like simply using the parametric form of the circle equation would be way easier

For t in 0 to 2 pi in whatever small steps you want, draw_pixel(x=a+rcos t, y=b+rsin t) where a and b are the x and y coordinates you want for the centre of the circle and r is the radius.

The derivation of this form is pretty simple if you know basic trig. https://publish.obsidian.md/uncarved/3+Resources/Public/Unit...

ericyd1y ago

A cool write up with an approachable explanation of algorithm optimization! I wonder how long it would take me to arrive at this algorithm left to my own devices...

j / k navigate · click thread line to collapse

45 comments

corysama1y ago

Instead of minimizing total work, the goal is to maximize pipelined parallelism while skipping over unnecessary work in large blocks.

Someone1y ago

> a bygone era where computers executed 1 instruction per cycle without pipelining or prediction

1 instruction per cycle? What luxury bygone era did you grow up in?

Wikipedia tells me the algorithm is from 1962 on an IBM 1401 (https://en.wikipedia.org/wiki/Bresenham's_line_algorithm#His...

That definitely didn’t have many single cycle instructions. Skimming https://ibm-1401.info/A24-6447-0_1401_1460_Instruction_and_T..., I couldn’t find any.

Certainly, in the era of 8-bit CPUs like Z80 and 6502 programmers would have been lyric about “1 instruction per cycle”

Actually, did any CPU ever “execute 1 instruction per cycle without pipelining or prediction” (or, slightly looser “had fixed time instructions without pipelining or prediction”)?

RISC introduced fixed time instructions, but also pipelining.

chillingeffect1y ago

kragen1y ago

no, you're right, you almost always need pipelining to get one instruction per clock cycle

so pipelining gets you from, say, 3 clocks per instruction down to 1.2 or so. then prediction gets you from 1.2 down to, say, 1.02

buescher1y ago

I remember the bragging point on the RTX2000 in the very late eighties was "a MIP per megahertz".

kragen1y ago

my limited experience with stack machines is that a stack mip is about half a register-machine mip :-(

consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):

            @@ Set sleep for iwait to r0 milliseconds.
            @@ (r0 must be under 1000)
            .thumb_func
    waitis: ldr r2, =wait           @ struct timeval
            movs r3, #0             @ 0 sec
            str r3, [r2]            @ .tv_sec = 0
            ldr r1, =1000           @ multiplier for ms
            mul r0, r1
            str r0, [r2, #4]        @ set .tv_usec
            bx lr

            .bss
    wait:   .fill 8                 @ the struct timeval
            .text

    lit(0)      ; load immediate constant
    lea(wait)   ; load address of wait onto stack
    !           ; store 0 in wait
    lit(1000)   ; another constant
    *           ; multiply argument by constant
    lea(wait)   ; load address again
    lit(4)      ; offset of .tv_usec
    +           ; calculate address of wait.tv_usec
    !           ; store product in tv_usec
    ret         ; return from subroutine

the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this

fasa991y ago

owisd1y ago

Probably meant 'cycle' in the sense of instruction cycle, rather than clock cycle.

Someone1y ago

Even then, few CPUs execute all instructions in the same amount of time. Division, for example, still takes longer than a register move on most architectures. Multiplication, historically, did too.

IIRC, some of them also accomplished it by running some of the simpler instructions slower than necessary (raw speed isn’t as important as being hard real time isn many applications)

1 more reply

chiph1y ago

ack_complete1y ago

kragen1y ago

oh wow, this is awesome, thanks! i never realized!

eru1y ago

It's also from an era when floats were rather expensive.

mabster1y ago

I still picture them as expensive. Things like trig functions are still very expensive.

dahart1y ago

1 more reply

eru1y ago

Yes, but here it's about avoiding multiplication (and division).

I suspect on a modern processor the branches (ie "if"s) in Bresenham's algorithm are gonna be more expensive than the multiplications and divisions in the naive algorithm.

1 more reply

_0ffh1y ago

Nah, while cycles/instruction where indeed fixed in those days (and for some time yet to come), it was not necessarily 1 cycle but rather depended on the instruction.

tengwar21y ago

userbinator1y ago

In particular, his line-drawing algorithm is easily beaten by fixed-point methods, which also have the advantage of a very short and non-branchy inner loop:

https://news.ycombinator.com/item?id=9954975

mysterydip1y ago

amelius1y ago

Sounds like your algorithm is from a bygone era where power was not important ;)

corysama1y ago

If you can get the work done fast and hurry the processor back into a low-power sleep state ASAP, it can actually be power efficient too!

phire1y ago

1 instruction per cycle? No, that's only possible with pipelining.

We are talking about instructions which took 8-12 cycles to complete.

kragen1y ago

usually this is correct, but there are some exceptions. most instructions on a non-pipelined but synchronous stack machine like the mup21 take a single cycle, for example

brcmthrowaway1y ago

Got an exsmple?

possiblywrong1y ago

This last is not quite true. The exact distance from the circle, call it G(x,y), is the corresponding difference of square roots, i.e.,

  def G(x, y, r):
    return math.sqrt(x * x + y * y) - math.sqrt(r * r)

and G(x,y) isn't just the square root of F(x,y), and indeed doesn't behave monotonically with respect to F(x,y).

bsenftner1y ago

kragen1y ago

alcorn?

bsenftner1y ago

Allan Alcorn, American electrical engineer and computer scientist, is an American pioneering engineer and computer scientist best known for creating Pong, one of the first video games

kragen1y ago

yeah, i know him. not acorn

KingOfCoders1y ago

What I found most amazing about Bresenham during my Amiga days, is how you can use it to zoom and shrink bitmaps.

nikolay1y ago

seanhunter1y ago

Feels like simply using the parametric form of the circle equation would be way easier

For t in 0 to 2 pi in whatever small steps you want, draw_pixel(x=a+rcos t, y=b+rsin t) where a and b are the x and y coordinates you want for the centre of the circle and r is the radius.

The derivation of this form is pretty simple if you know basic trig. https://publish.obsidian.md/uncarved/3+Resources/Public/Unit...

ericyd1y ago

A cool write up with an approachable explanation of algorithm optimization! I wonder how long it would take me to arrive at this algorithm left to my own devices...

j / k navigate · click thread line to collapse