undefined | Better HN

0 pointsnowittyusername6mo ago0 comments

My man, we now have llms that are anywhere between 130 million to 1 trillion parameters available for us to run locally, I can guarantee there is a model for you there that even your toaster can run. I have a RTX 4090 but for most of my fiddling i use small models like Qwen 3 4b and they work amazing so there's no excuse :P.

0 comments

8note6mo ago

well, i got some gemini models running on my phone, but if i switch apps, android kills it, so the call to the server always hangs... and then the screen goes black

the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.

i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.

my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge

nowittyusernameOP6mo ago

I recommend use low quantized models first. for example anywhere between q4 and q8 gguf models. Also dont need high context to fiddle around and learn the ins and outs. for example 4k context is more then enough to figure out what you need in agentic solutions. In fact thats a good limit to impose on yourself and start developing decent automatic context management systems internally as that will be very important when making robus agentic solutions. with all that you should be able to load an llm no issues on many devices.

tmzt6mo ago

If it helps, you can disable some of those limitations on Android:

https://www.reddit.com/r/AndroidQuestions/comments/16r1cfq/p...

j / k navigate · click thread line to collapse