undefined | Better HN

0 pointssharpneli12y ago0 comments

I did watch the video. It is a neat idea. However it is something that is already made explicitly possible by OpenCL.

If you write a kernel which doesn't use local memory nor doesn't use the local_id it produces a kernel that is effectively a Halide pipeline stage. The points can be evaluated in arbitrary order (spec says everything is implementation defined).

If we look at the blur example on the video the OpenCL implementation is also free to effectively merge the stages like in Halide. It's because the spec only defines that whatever the previous kernel invocation has written must be visible for the next kernel. Nothing more and nothing less.

Sure OpenCL allows you to fiddle around with low level details, but it also allows you to write completely platform neutral code that is then the responsibility of the platform to actually optimize.

I do agree that Halide allows you to easily explore the different scheduling options, that's something OpenCL is not capable of. In OpenCL you either do it manually or leave it as totally defined by the implementation.

0 comments

zalman12y ago

"OpenCL implementation is also free to effectively merge the stages like in Halide."

I gave up relying on magic compilers a long time ago. And having worked in this domain for a long time I'm actually offended by people who write off the problem as simply a matter of a good enough optimizer. This has significantly held back both performance and portability in imaging. (And likely other areas.)

Halide is not magic, it is just a better slicing of the problem backed up by a good implementation. As always there is no free lunch but when it comes down to actually shipping this kind of code across a wide variety of platforms with great performance, it is a lot more productive than anything else out there.

sharpneliOP12y ago

I do agree that relying on compiler optimizations is a waste of time. They never seem to appear. I simply wished to point out that you can write Halide like code also in OpenCL. And I'd love to see an implementation which would attempt similar style of optimizations what Halide allows.

Halide is extremely domain specific, which is a good thing, it allows them to focus on the problem at hand, namely on how to easily write image processing filters that can be made performant with relative ease. However I would not wish to write a bitonic sort or anything like that in Halide.

zalman12y ago

Andrew Adams did write bitonic sort: https://github.com/halide/Halide/blob/master/test/performanc...

As I wrote in another comment, the domain of problems for which Halide works is broader than imaging. I usually present it as "data parallel problems." In fact, I'd say the difference in domain between what Halide is good at and what OpenCL and CUDA are good at is not that significant in practice because those languages are basically C/C++ outside of kernel parallelism. (They are each adding some task parallelism facilities as well.)

j / k navigate · click thread line to collapse

0 comments

zalman12y ago

"OpenCL implementation is also free to effectively merge the stages like in Halide."

sharpneliOP12y ago

zalman12y ago

Andrew Adams did write bitonic sort: https://github.com/halide/Halide/blob/master/test/performanc...

j / k navigate · click thread line to collapse