Both models are using around 3GB right now (converted into FP16 for speed). But I checked that the (slower) FP32 version uses 2.3GB so we are probably doing something suboptimal here.
We support CUDA right now although it should not be too hard to port it to whisper/llama.cpp or Apple's MLX. It's a pretty straightforward transformer architecture.