I'd be very surprised if Apple can put something on the level of GPT4 on a handheld. Remember, GPT4 is estimated to be around 1.7 trillion parameters. That's 3.4TB at 16 bit and it would still be ~340GB at 1.58bits. The best we can hope for is a low-ish level few billion parameter model. Which would still be cool on a phone, but as of today these models are nowhere near GPT4.
You don't need "GPT4" though. Mixtral 8x7B is robust and can be run in 36 Gb, 24 Gb if you're willing to compromise. A 1.5 bit quantization should bring it down to 16. That's still a lot compared to the iPhone 15's 6, but it's close enough to imagine it happening soon. With some kind of streaming-from-flash architecture you might be in the realm already.
> With some kind of streaming-from-flash architecture you might be in the realm already.
I thought mmap'ing models to only keep the currently needed pieces in RAM was something that was figured out ~6 months ago? Performance wasn't terribly great iirc, but with how much faster 1.58B is, it should still be okay-ish.
They won't have something at that size because as you pointed out, it is still huge. But depending on how they are used, smaller parameter models may be better for specific on-phone tasks that start to make the size of the model not a problem. GPT4 is so large because it is very general purpose with the goal seeming to be to answer anything. You could have a smaller model focused solely on Siri or something that wouldn't require the parameter size of GPT4
The thing a about GPT4 that matters so much is not just raw knowledge retention, but complex, abstract reasoning and even knowing what it doesn't know. We haven't seen that yet in smaller models and it's unclear if it is even possible. The best we could hope for right now is a better natural language interface than Siri for calling OS functions.