ONNX Runtime and CoreML May Silently Convert Your Model to FP16 (opens in new tab)

(ym2132.github.io)

98 pointsTwo_hands5mo ago17 comments

17 comments

This was an interesting read, thanks for sharing. I've recently been building something that uses Parakeet v2/v3 models, I'm using the parakeet-rs package (https://github.com/altunenes/parakeet-rs) which has had a few issues running models with CoreML (unrelated to the linked post), e.g. https://github.com/microsoft/onnxruntime/issues/26355

Two_handsOP5mo ago

Thank you for reading.

Also generally I think CoreML isn't the best. The best solution for ORT would probably be to introduce a pure MPS provider (https://github.com/microsoft/onnxruntime/issues/21271), but given they've already bought into CoreML the effort may not be worth the reward for the core team. Which fair enough as it's a pretty mammoth task

pzo5mo ago

However one benefits of CoreML - it is the only way to be able for 3rd party to execute on ANE (Apple Neural Engine aka NPU). ANE for some models can execute even faster than GPU/MPS and consume even less battery.

But I agree CoreML in ONNX Runtime is not perfect - most of the time when I tested some models there were too many partitioning and whole graph was running slower compare when using only model in just CoreML format.

Two_handsOP5mo ago

To be honest it's a shame the whole thing is closed up, I guess it's to be expected from Apple, but I reckon CoreML would be benefit a lot from at least exposing the internals/allowing users to define new ops.

Also, the ANE only allows some operators to be ran on it right? There's very little transparency/control on what can be offloaded to it and cannot which makes using it difficult.

trashtensor5mo ago

if you double click the coreml file in a mac and open xcode there is a profiler you can run. the profiler will show you the operations it's using and what the bit depth is.

Two_handsOP5mo ago

cheers for the tip, I'll give it a go

nuc1e0n5mo ago

My experiences with ONNX have not been pleasant. Conversions from models written with Tensorflow and Pytorch often fail. I recommend using TFLite or Executorch for deployment to edge devices instead.

Two_handsOP4mo ago

Agreed, I have seen some speedups with ONNX if I'm being honest but the process especially on MacOS is a bit messy. I'll try out Executorch and see how it compares, cheers for the recommendation

yousifa5mo ago

On the coreml side this is likely because the neural engine supports fp16 and offloading some/all layers to ANE significantly increases inference time and power usage when running models. You can inspect in the Xcode profiler to see what is running on each part of the device at what precision.

Two_handsOP5mo ago

Yeah I can see why they let it be that way, but the fact it is pretty undefined is what bugged me. I suppose it depends on what your goals are - efficiency vs reproducibility.

Also I did run a test of FP16 vs FP32 for a large matmul on the Apple GPU and the FP16 calculation was 1.28x faster so it makes sense that they'd go for FP16 as a default.

j / k navigate · click thread line to collapse

17 comments

smcleod5mo ago

Two_handsOP5mo ago

Thank you for reading.

pzo5mo ago

Two_handsOP5mo ago

Also, the ANE only allows some operators to be ran on it right? There's very little transparency/control on what can be offloaded to it and cannot which makes using it difficult.

trashtensor5mo ago

if you double click the coreml file in a mac and open xcode there is a profiler you can run. the profiler will show you the operations it's using and what the bit depth is.

Two_handsOP5mo ago

cheers for the tip, I'll give it a go

nuc1e0n5mo ago

My experiences with ONNX have not been pleasant. Conversions from models written with Tensorflow and Pytorch often fail. I recommend using TFLite or Executorch for deployment to edge devices instead.

Two_handsOP4mo ago

Agreed, I have seen some speedups with ONNX if I'm being honest but the process especially on MacOS is a bit messy. I'll try out Executorch and see how it compares, cheers for the recommendation

yousifa5mo ago

Two_handsOP5mo ago

Yeah I can see why they let it be that way, but the fact it is pretty undefined is what bugged me. I suppose it depends on what your goals are - efficiency vs reproducibility.

Also I did run a test of FP16 vs FP32 for a large matmul on the Apple GPU and the FP16 calculation was 1.28x faster so it makes sense that they'd go for FP16 as a default.

j / k navigate · click thread line to collapse