If you write a kernel which doesn't use local memory nor doesn't use the local_id it produces a kernel that is effectively a Halide pipeline stage. The points can be evaluated in arbitrary order (spec says everything is implementation defined).
If we look at the blur example on the video the OpenCL implementation is also free to effectively merge the stages like in Halide. It's because the spec only defines that whatever the previous kernel invocation has written must be visible for the next kernel. Nothing more and nothing less.
Sure OpenCL allows you to fiddle around with low level details, but it also allows you to write completely platform neutral code that is then the responsibility of the platform to actually optimize.
I do agree that Halide allows you to easily explore the different scheduling options, that's something OpenCL is not capable of. In OpenCL you either do it manually or leave it as totally defined by the implementation.